Skip to content

Latest commit

 

History

History
103 lines (77 loc) · 5.39 KB

dataset_readme.md

File metadata and controls

103 lines (77 loc) · 5.39 KB

PyABSA - Open Framework for Aspect-based Sentiment Analysis

PyPI - Python Version PyPI PyPI_downloads License

total views total views per week total clones total clones per week

PWC

Hi, there! Please star this repo if it helps you! Each Star helps PyABSA go further, many thanks.

To augment your datasets, please refer to BoostTextAugmentation

Auto-annoate your datasets via PyABSA!

There is an experimental feature which allows you to auto-build APC dataset and ATEPC datasets, see the usage here:

from pyabsa import make_ABSA_dataset 

# refer to the comments in this function for detailed usage
make_ABSA_dataset(dataset_name_or_path='review', checkpoint='english')

Public and Community-shared ABSADatasets

More datasets are available at ABSADatasets.

Annotate Your Own Dataset

The repo ABSADatasets/DPT provides an open-source dataset annotating tool, you can easily annotate your dataset before using PyABSA.

Important: Rename your dataset filename before use it in PyABSA

Although the integrated datasets have no ids, it is recommended to assign an id for your dataset. While merge your datasets into ABSADatasets, please keep the id remained.

  • APC dataset name should be {id}.{dataset name}, and the dataset files should be named in {dataset name}.{type}.dat.atepc e.g.,
datasets
├── 101.restaurant
│    ├── restaurant.train.dat  # train_dataset
│    ├── restaurant.test.dat  # test_dataset
│    └── restaurant.valid.dat  # valid_dataset, dev set are not recognized in PyASBA, please rename dev-set to valid-set
└── others
  • ATEPC dataset files should be {id}.{dataset name}.{type}.dat.atepc, e.g.,
datasets
├── 101.restaurant
│    ├── restaurant.train.dat.atepc  # train_dataset
│    ├── restaurant.test.dat.atepc  # test_dataset
│    └── restaurant.valid.dat.atepc  # valid_dataset, dev set are not recognized in PyASBA, please rename dev-set to valid-set
└── others

Fit on Your Existing Dataset

  • First, refer to ABSADatasets to prepare your dataset into acceptable format.
  • You can PR to contribute your dataset and use it like ABDADatasets.your_dataset (All the datasets are for research only, shall not danger your data copyright)

Use Human-readable Labels in Your Dataset

PyABSA encourages you to use string labels instead of numbers. e.g., sentiment labels = {negative, positive, Neutral, unknown}

  • What labels you use in the dataset, what labels will be output in inference
  • You can train a model using multiple datasets with same sentiment labels, and you can even contribute and define a combination of datasets here!
  • The version information of PyABSA is also available in the output while loading checkpoints training args.