Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AutoNLP TrainerBase and Text Classification #3728

Merged
merged 76 commits into from
Jan 5, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
76 commits
Select commit Hold shift + click to select a range
303fa26
init commit; unit test pass
sijunhe Nov 10, 2022
a2d0461
ready for quick review
sijunhe Nov 11, 2022
f6cad1c
add types
sijunhe Nov 11, 2022
fc75035
isort,black,flake8
sijunhe Nov 11, 2022
ed39de8
mypy passes, other than paddle APIs
sijunhe Nov 11, 2022
57bd571
yapf training_args
sijunhe Nov 11, 2022
7cf2bdc
yapf
sijunhe Nov 11, 2022
a758ae8
remove afqmc
sijunhe Nov 11, 2022
0727aac
import error
sijunhe Nov 11, 2022
4af58cb
wip
sijunhe Nov 14, 2022
eb99889
ready for review
sijunhe Nov 14, 2022
c36732d
ready for revie
sijunhe Nov 14, 2022
d07ccf1
yapf
sijunhe Nov 14, 2022
c8e15ba
Merge branch 'develop' into autonlp_trainer
sijunhe Nov 14, 2022
e8b8a06
implement predict, export, show_training_results API
sijunhe Nov 15, 2022
53e44de
Merge branch 'autonlp_trainer' of https://github.com/PaddlePaddle/Pad…
sijunhe Nov 15, 2022
cdec5f3
styles
sijunhe Nov 15, 2022
3a335da
add classification metrics
sijunhe Nov 15, 2022
13d27ab
styles and docstring
sijunhe Nov 15, 2022
38a87a5
merge with develop and resolve
sijunhe Nov 18, 2022
8c5c829
pre-commit
sijunhe Nov 18, 2022
3f7ef0e
merge with master
sijunhe Nov 24, 2022
b07e7e9
to taskflow works
sijunhe Nov 24, 2022
4190cbb
Merge remote-tracking branch 'origin/develop' into autonlp_trainer
sijunhe Nov 29, 2022
8826b36
fix styles
sijunhe Nov 29, 2022
17dea29
implemented multilabel classification
sijunhe Nov 30, 2022
0ff1e18
config -> candidates
sijunhe Dec 2, 2022
001d7cb
Merge remote-tracking branch 'origin/develop' into autonlp_trainer
sijunhe Dec 5, 2022
a34554f
wip
sijunhe Dec 5, 2022
0607add
merge
sijunhe Dec 5, 2022
f87a48e
wip
sijunhe Dec 6, 2022
aa56171
prompt trainer works
sijunhe Dec 6, 2022
34c9c30
merging
sijunhe Dec 13, 2022
d39fd65
modify export implementations
sijunhe Dec 13, 2022
f800741
control model file size
sijunhe Dec 14, 2022
da72635
wip
sijunhe Dec 16, 2022
0764ebc
Merge remote-tracking branch 'origin/develop' into autonlp_trainer
sijunhe Dec 16, 2022
1d92cce
tests work
sijunhe Dec 16, 2022
9475dbe
remove trainer
sijunhe Dec 16, 2022
efdfb10
multi-label tests
sijunhe Dec 16, 2022
30e0c30
Merge remote-tracking branch 'origin/develop' into autonlp_trainer
sijunhe Dec 16, 2022
696d5a9
add tests
sijunhe Dec 16, 2022
0b0f61d
bump paddlepaddle
sijunhe Dec 16, 2022
57b5811
change to 2.4.0rc0
sijunhe Dec 16, 2022
2aa0f0c
Merge remote-tracking branch 'origin/develop' into autonlp_trainer
sijunhe Dec 19, 2022
63f84bd
Merge branch 'develop' into autonlp_trainer
sijunhe Dec 19, 2022
a6a5f34
wip
sijunhe Dec 20, 2022
79a457a
tiny random bert to speed up unit test
sijunhe Dec 21, 2022
2782120
Merge branch 'autonlp_trainer' of https://github.com/PaddlePaddle/Pad…
sijunhe Dec 21, 2022
06eb14e
Merge remote-tracking branch 'origin/develop' into autonlp_trainer
sijunhe Dec 21, 2022
0fb2080
Merge branch 'develop' into autonlp_trainer
sijunhe Dec 21, 2022
00edf8f
use local_model for tests
sijunhe Dec 21, 2022
b17f5b4
Merge branch 'autonlp_trainer' of https://github.com/PaddlePaddle/Pad…
sijunhe Dec 21, 2022
5a5dc4e
merging
sijunhe Dec 27, 2022
56f9f14
wip
sijunhe Dec 27, 2022
4252a4a
Merge remote-tracking branch 'origin/develop' into autonlp_trainer
sijunhe Dec 27, 2022
beb41e6
wip
sijunhe Dec 27, 2022
dbe0e80
changes
sijunhe Dec 28, 2022
518e0c7
remove missing fn
sijunhe Dec 29, 2022
4b19cdc
redesigned overrides and custom model candidates
sijunhe Dec 29, 2022
36461c7
Merge remote-tracking branch 'origin/develop' into autonlp_trainer
sijunhe Dec 29, 2022
e49a048
test
sijunhe Dec 30, 2022
a8beae3
Merge remote-tracking branch 'origin/develop' into autonlp_trainer
sijunhe Jan 3, 2023
a35405d
update api
sijunhe Jan 3, 2023
97d7aae
evaluate works
sijunhe Jan 3, 2023
9ad445b
readme
sijunhe Jan 3, 2023
8778d1b
add chinese readme
sijunhe Jan 3, 2023
65710e9
Merge remote-tracking branch 'origin/develop' into autonlp_trainer
sijunhe Jan 3, 2023
8b9f0e6
error type
sijunhe Jan 3, 2023
9fba845
add verbosity
sijunhe Jan 3, 2023
3717303
add verbosity
sijunhe Jan 3, 2023
54a83af
verbosity fix
sijunhe Jan 3, 2023
a619f3d
set log level
sijunhe Jan 3, 2023
dc26101
address Zeyu's comment
sijunhe Jan 3, 2023
a6ea660
address Zeyu's comment
sijunhe Jan 3, 2023
48d5763
Merge remote-tracking branch 'origin/develop' into autonlp_trainer
sijunhe Jan 4, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ unit-test:
install:
pip install -r requirements-dev.txt
pip install -r requirements.txt
pip install -r paddlenlp/experimental/autonlp/requirements.txt
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里按道理不应该加入make install, 因为大部分开发都不需要autonlp的依赖,现在暂时为了跑单测加入

pre-commit install


Expand Down
146 changes: 146 additions & 0 deletions paddlenlp/experimental/autonlp/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
# AutoNLP
sijunhe marked this conversation as resolved.
Show resolved Hide resolved

**简体中文**🀄 | [English🌎](./README_en.md)

## 简介

**AutoNLP目前在实验阶段。在正式发布之前,AutoNLP API有可能会变动**

**AutoNLP** 是 PaddleNLP 的一个早期的实验性质的项目,旨在让NLP技术赋能百业。交付一个成功的 NLP 项目并不容易,因为它需要深入的NLP领域知识,而我们经常看到开发者在应用NLP技术的过程中遇到困难。这就是我们开发 **AutoNLP** 项目的原因。与为获得最先进的模型精度而使用大规模计算资源的传统 AutoML 方法相比,我们有不同的理念:

1. 我们的目标不是在大型集群,大型数据集上训练最先进的模型,而是**在有限计算资源下的训练出不错模型**。我们假设我们的用户最多只有几个 GPU,并且希望在8小时内训练出不错的模型。您可以在 [Baidu AI Studio](https://aistudio.baidu.com/aistudio) 免费获得此级别的计算资源。
2. AutoNLP的目标是提供**低代码**的解决方案,使您能够用几行代码训练出不错的模型,但它不是无代码的模型训练服务。
3. 我们将尽可能地**自动化和抽象化** PaddleNLP已有的**全流程能力**(例如 预处理,分词,微调,提示学习,模型压缩,一键部署等等),助力开发者对于自己的使用场景进行快速适配与落地。
4. 我们的工作是**免费和开源**的。

## 安装

安装 **AutoNLP** 与安装 PaddleNLP 非常相似,唯一的区别是 需要添加`[autonlp]`的标签。

```
pip install -U paddlenlp[autonlp]
```

您还可以从我们的 [GitHub](https://github.com/PaddlePaddle/PaddleNLP) clone并通过“pip install .[autonlp]”从源代码安装来获取develop分支中的最新成果。

## 基础使用

由于目前AutoNLP唯一支持的任务是文本分类,因此以下文档是关于 **AutoTrainerForTextClassification** 的使用用法。您也可以参考我们的 AiStudio notebook (To be added)

### 创建AutoTrainerForTextClassification对象

`AutoTrainerForTextClassification` 是您用来运行模型实验并与经过训练的模型交互的主要类,您可以像下面这样构造它:

```python
auto_trainer = AutoTrainerForTextClassification(
train_dataset=train_ds,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的输入这一侧,是否要把datasets概念传递给用户了?

Copy link
Collaborator Author

@sijunhe sijunhe Jan 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

对,这里暂时是这样设计的,用户需要自己把数据集转成datasets格式。一个是trainer本来就需要Datasets, 所以这里正好衔接上。还有就是MVP版本尽量轻量化,减少非核心功能,如果后续有需求,可以支持pd.DataFrame之类的转化

eval_dataset=dev_ds,
label_column="labels",
text_column="sentence",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

对于文本分类任务中输入有两个columns输入,这里是否兼容了?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

刻意不兼容,后续会有专门的AutoTrainerForSemanticSearch

language="Chinese",
output_dir="temp"
)
```

Args:

- train_dataset (Dataset, required): `paddle.io.Dataset` 格式的训练数据集,必须包含下面指定的 `text_column` 和 `label_column`
- eval_dataset (Dataset, required): `paddle.io.Dataset`格式的评估数据集,必须包含下面指定的`text_column`和`label_column`
- text_column (string, required): 数据集中的文本字段,为模型的主要输入。
- label_column (string, required): 数据集中的标签字段
- language (string, required): 文本语言
- metric_for_best_model (string, optional): 用来选择最优模型的评估指标
- greater_is_better (bool, optional): 更好的模型是否应该有更大的指标。与`metric_for_best_model`结合使用
- problem_type (str, optional): 根据问题的性质在 [`multi_class`, `multi_label`] 中选择
- output_dir (str, optional): 输出目录,默认为`autpnlp_results`

### 训练

您可以使用以下命令开始训练模型:

```python
auto_trainer.train(
num_cpus=2,
num_gpus=1,
max_concurrent_trials=1,
num_models=10,
time_budget_s=60 * 10,
verbosity=1
)
```
Args:

- num_models (int, required): 模型试验数量
- num_gpus (str, optional): 实验使用的 GPU 数量。默认情况下,这是根据检测到的 GPU 设置的。
- num_cpus (str, optional): 实验使用的 CPU 数量。默认情况下,这是根据检测到的 vCPU 设置的。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vCPU -> CPU

Copy link
Collaborator Author

@sijunhe sijunhe Jan 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里确实是vCPU, 英语原文是virtual core, 来自底层调用的ray

- max_concurrent_trials (int, optional): 同时运行的最大试验数。必须是非负数。如果为 None 或 0,则不应用任何限制。默认为None。
- time_budget_s: (int|float|datetime.timedelta, optional) 以秒为单位的全局时间预算,超过时间后停止所有模型试验。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的时间限制倒是没有必要提供这么多类型,可以直接int、float类型

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个来自于底层ray的配置,既然人家配置好了,我也乐于放出来

- experiment_name: (str, optional): 实验的名称。实验日志将存储在"<output_dir>/<experiment_name>"下。默认为 UNIX 时间戳。
- verbosity: (int, optional): 控制日志的详细程度。默认为“0”,将日志级别设置为 INFO。如果需要减少日志量,请使用 `verbosity > 0` 将日志级别设置为 WARNINGS
- hp_overrides: (dict[str, Any], optional): (仅限高级用户)。覆盖每个候选模型的超参数。例如,`{"TrainingArguments.max_steps":5}`。
- custom_model_candiates: (dict[str, Any], optional): (仅限高级用户)。运行用户提供的候选模型而不 PaddleNLP 的默认候选模型。可以参考 `._model_candidates` 属性


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里还是有疑问?AutoNLP看起来没有统一对外接口,而是不同的任务统一接口,这里的考虑是什么了?

Copy link
Collaborator Author

@sijunhe sijunhe Jan 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

以后会有,当前MVP因为只有一个分类的class, 暂时还没有配置。后续会支持类似Taskflow这种,加上一个task_type

### 评估和检查实验结果

#### 检查实验结果

实验结束后,您可以像下面这样检查实验结果,它会打印一个 pandas DataFrame:

```
auto_trainer.show_training_results()
```

您还可以在`<output_dir>/experiment_results.csv`下找到实验结果。不同实验产生的模型的标识符是`trial_id`,您可以在 DataFrame 或 csv 文件中找到这个字段。

#### 加载以前的实验结果

您可以从之前的运行(包括未完成的运行)中恢复实验结果,如下所示:

```python
auto_trainer.load("path/to/previous/results")
```

这使您能够使用 `show_training_results` API 来检查结果。再次调用 train() 将覆盖之前的结果。

#### 使用不同的评估数据集

除了使用构建 AutoTrainerForTextClassification 的时候提供的评估数据集以外,您也可以使用其他的数据集进行评估:

```
auto_trainer.evaluate(
trial_id="trial_123456",
eval_dataset=new_eval_dataset
)
```

Args:
- trial_id (str, optional): 通过 `trial_id` 指定要评估的模型。默认为由`metric_for_best_model`决定的最佳模型
- eval_dataset (Dataset, optional): 自定义评估数据集,并且必须包含`text_column`和`label_column`字段。如果未提供,则默认为构建时使用的评估数据集



### 模型输出与部署

如果需要导出模型供以后使用,可以使用以下的API:

```
auto_trainer.export(
trial_id="trial_123456",
export_path="different/path/to/store/the/model"
)
```

Args:
- export_path (str, required): 输出路径
- trial_id (int, required): 通过 `trial_id` 指定要评估的模型。默认为由`metric_for_best_model`决定的最佳模型

同时我们还提供了`to_taskflow()`的API,可以直接将模型转换为 `Taskflow` 进行推理:

```
taskflow = auto_trainer.to_taskflow()
taskflow("this is a test input")
```

Args:
- trial_id (int, required): 通过 `trial_id` 指定要评估的模型。默认为由`metric_for_best_model`决定的最佳模型
146 changes: 146 additions & 0 deletions paddlenlp/experimental/autonlp/README_en.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
# AutoNLP
sijunhe marked this conversation as resolved.
Show resolved Hide resolved

[简体中文🀄](./README_cn.md) | **English**🌎

# Introduction

**The AutoNLP APIs are subjective to significant changes until formal release**

**AutoNLP** is an experimental project by PaddleNLP to democratize NLP for everyone. Delivering a successful NLP project is not easy, as it requires deep domain knowledge. Time after time, we have seen people struggle to make NLP work on their dataset, for their projects, which is why we are building **AutoNLP**. Compared with the traditional AutoML approach of massive paid compute for State-of-the-Art model performance, we have a different philosphy:


1. Instead of training State-of-the-Art models on huge datasets running on huge clusters, our goal is to deliver **decent models under limited compute**. We assume our users have a few GPUs at most and want to get decent models under 8 hours on their own in-house datasets. Note that you can get this level of compute for FREE on [Baidu AI Studio](https://aistudio.baidu.com/aistudio).
2. Our solution is **low-code** and enables you to train good models with a few lines of code but it won't be no code / drag and drop.
3. Leverage the **full-cycle capability** of PaddleNLP, We intent to **automate and abstract away** as much of NLP as possible, ranging from preprocessing to tokenizing, from finetuning to prompt tuning, from model compression to deloyment, etc.
4. Our work is and always will be **free and open-sourced**.

## Installation

Installing **AutoNLP** is very similar to installing PaddleNLP, with the only difference being the `[autonlp]` tag.

```
pip install -U paddlenlp[autonlp]
```

You can also get our latest work in the develop branch by cloning from our [GitHub](https://github.com/PaddlePaddle/PaddleNLP) and install from source via `pip install .[autonlp]`.

## Basic Usage

Since the only supported task is Text Classification for now, the following documentation are on the usage of **AutoTrainerForTextClassification**. You can also follow our AiStudio notebook for example.

### Constructor

`AutoTrainerForTextClassification` is the main class which you use to run model experiments and interact with the trained models You can construct it like the following:

```python
auto_trainer = AutoTrainerForTextClassification(
train_dataset=train_ds,
eval_dataset=dev_ds,
label_column="labels",
text_column="sentence",
language="Chinese",
output_dir="temp"
)
```

Args:

- train_dataset (Dataset, required): Training dataset in the format of `paddle.io.Dataset`, must contains the 'text_column' and 'label_column' specified below
- eval_dataset (Dataset, required): Evaluation dataset in the format of `paddle.io.Dataset`, must contains the 'text_column' and 'label_column' specified below
- text_column (string, required): Name of the column that contains the input text.
- label_column (string, required): Name of the column that contains the target variable to predict.
- language (string, required): language of the text
- metric_for_best_model (string, optional): the name of the metrc for selecting the best model.
- greater_is_better (bool, optional): Whether better models should have a greater metric or not. Use in conjuction with `metric_for_best_model`.
- problem_type (str, optional): Select among ["multi_class", "multi_label"] based on the nature of your problem
- output_dir (str, optional): Output directory for the experiments, defaults to "autpnlp_results"

### Train

You can start training with the following command:

```python
auto_trainer.train(
num_cpus=2,
num_gpus=1,
max_concurrent_trials=1,
num_models=10,
time_budget_s=60 * 10,
verbosity=1
)
```
Args:

- num_models (int, required): number of model trials to run
- num_gpus (str, optional): number of GPUs to use for the job. By default, this is set based on detected GPUs.
- num_cpus (str, optional): number of CPUs to use for the job. By default, this is set based on virtual cores.
- max_concurrent_trials (int, optional): maximum number of trials to run concurrently. Must be non-negative. If None or 0, no limit will be applied. Defaults to None.
- time_budget_s: (int|float|datetime.timedelta, optional) global time budget in seconds after which all model trials are stopped.
- experiment_name: (str, optional): name of the experiment. Experiment log will be stored under `<output_dir>/<experiment_name>`. Defaults to UNIX timestamp.
- hp_overrides: (dict[str, Any], optional): Advanced users only. override the hyperparameters of every model candidate. For example, {"TrainingArguments.max_steps": 5}.
- custom_model_candiates: (dict[str, Any], optional): Advanced users only. Run the user-provided model candidates instead of the default model candidated from PaddleNLP. See `._model_candidates` property as an example
- verbosity: (int, optional): controls the verbosity of the logger. Defaults to `0`, which set the logger level at INFO. To reduce the amount of logs, use `verbosity > 0` to set the logger level to WARNINGS

### Evaluations and Examine Results

#### Examine Results

Once the experimenets conclude, you can examine the experiment results like the following, which prints a pandas DataFrame:

```
auto_trainer.show_training_results()
```

You can also find the experiment results under `<output_dir>/experiment_results.csv`. The identifier for the models produced by different experiments is `trial_id`, which you can find in the `DataFrame` or the csv file.

#### Load Previous Results

You can recover the experiment results from a previous run (including unfinished runs) like the following:

```python
auto_trainer.load("path/to/previous/results")
```

This enables you to use the `show_training_results` API to examine the results. Call `train()` again will override the previous results.

#### Custom Evaluations

To evaluate on datasets other than the evaluation dataset provided to `AutoTrainerForTextClassification` at construction, you can use the

```
auto_trainer.evaluate(
trial_id="trial_123456",
eval_dataset=new_eval_dataset
)
```

Args:
- trial_id (str, optional): specify the model to be evaluated through the `trial_id`. Defaults to the best model, ranked by `metric_for_best_model`
- eval_dataset (Dataset, optional): custom evaluation dataset and must contains the 'text_column' and 'label_column' fields. If not provided, defaults to the evaluation dataset used at construction



### Export and Inference

To export a model for later use, do:

```
auto_trainer.export(
trial_id="trial_123456",
export_path="different/path/to/store/the/model"
)
```

Args:
- export_path (str, required): the filepath for export
- trial_id (int, required): use the `trial_id` to select the model to export. Defaults to the best model selected by `metric_for_best_model`

We also provide a convenience method to directly convert a model to a Taskflow for inference:

```
taskflow = auto_trainer.to_taskflow()
taskflow("this is a test input")
```

Args:
- trial_id (int, required): use the `trial_id` to select the model to export. Defaults to the best model selected by `metric_for_best_model`
15 changes: 15 additions & 0 deletions paddlenlp/experimental/autonlp/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# flake8: noqa
from .text_classification import *
Loading