Skip to content

Commit

Permalink
add AJGT dataset (huggingface#1078)
Browse files Browse the repository at this point in the history
* add AJGT dataset

* Update datasets/ajgt_twitter_ar/ajgt_twitter_ar.py

* Update datasets/ajgt_twitter_ar/dataset_infos.json

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
  • Loading branch information
2 people authored and sileod committed Dec 7, 2020
1 parent c8e64c1 commit 4f8fb27
Show file tree
Hide file tree
Showing 4 changed files with 243 additions and 0 deletions.
139 changes: 139 additions & 0 deletions datasets/ajgt_twitter_ar/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
---
annotations_creators:
- found
language_creators:
- found
languages:
- ar
licenses:
- unknown
multilinguality:
- monolingual
size_categories:
- 1k<n<10k
source_datasets:
- original
task_categories:
- text_classification
task_ids:
- sentiment-classification
---

# Dataset Card for MetRec

## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-instances)
- [Data Splits](#data-instances)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Discussion of Social Impact and Biases](#discussion-of-social-impact-and-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)

## Dataset Description

- **Homepage:** [AJGT](https://github.com/komari6/Arabic-twitter-corpus-AJGT)
- **Repository:** [AJGT](https://github.com/komari6/Arabic-twitter-corpus-AJGT)
- **Paper:** [Arabic Tweets Sentimental Analysis Using Machine Learning](https://link.springer.com/chapter/10.1007/978-3-319-60042-0_66)
- **Point of Contact:** [Khaled Alomari](khaled.alomari@adu.ac.ae)

### Dataset Summary

Arabic Jordanian General Tweets (AJGT) Corpus consisted of 1,800 tweets annotated as positive and negative. Modern Standard Arabic (MSA) or Jordanian dialect.

### Supported Tasks and Leaderboards

The dataset was published on this [paper](https://link.springer.com/chapter/10.1007/978-3-319-60042-0_66).

### Languages

The dataset is based on Arabic.

## Dataset Structure

### Data Instances

A binary datset with with negative and positive sentiments.

### Data Fields

[More Information Needed]

### Data Splits

The dataset is not split.

| | Tain |
|---------- | ------ |
|no split | 1,800 |

## Dataset Creation

### Curation Rationale

[More Information Needed]

### Source Data

[More Information Needed]

#### Initial Data Collection and Normalization

Contains 1,800 tweets collected from twitter.

#### Who are the source language producers?

From tweeter.

### Annotations

The dataset does not contain any additional annotations.

#### Annotation process

[More Information Needed]

#### Who are the annotators?

[More Information Needed]

### Personal and Sensitive Information

[More Information Needed]

## Considerations for Using the Data

### Discussion of Social Impact and Biases

[More Information Needed]

### Other Known Limitations

[More Information Needed]

## Additional Information

### Dataset Curators

[More Information Needed]

### Licensing Information

[More Information Needed]

### Citation Information

[More Information Needed]
103 changes: 103 additions & 0 deletions datasets/ajgt_twitter_ar/ajgt_twitter_ar.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# coding=utf-8
# Copyright 2020 The TensorFlow Datasets Authors and the HuggingFace Datasets Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Lint as: python3
"""Arabic Jordanian General Tweets."""

from __future__ import absolute_import, division, print_function

import os

import pandas as pd

import datasets


_DESCRIPTION = """\
Arabic Jordanian General Tweets (AJGT) Corpus consisted of 1,800 tweets \
annotated as positive and negative. Modern Standard Arabic (MSA) or Jordanian dialect.
"""

_CITATION = """\
@inproceedings{alomari2017arabic,
title={Arabic tweets sentimental analysis using machine learning},
author={Alomari, Khaled Mohammad and ElSherif, Hatem M and Shaalan, Khaled},
booktitle={International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems},
pages={602--610},
year={2017},
organization={Springer}
}
"""

_URL = "https://raw.githubusercontent.com/komari6/Arabic-twitter-corpus-AJGT/master/"


class AjgtConfig(datasets.BuilderConfig):
"""BuilderConfig for Ajgt."""

def __init__(self, **kwargs):
"""BuilderConfig for Ajgt.
Args:
**kwargs: keyword arguments forwarded to super.
"""
super(AjgtConfig, self).__init__(version=datasets.Version("1.0.0", ""), **kwargs)


class AjgtTwitterAr(datasets.GeneratorBasedBuilder):
"""Ajgt dataset."""

BUILDER_CONFIGS = [
AjgtConfig(
name="plain_text",
description="Plain text",
)
]

def _info(self):
return datasets.DatasetInfo(
description=_DESCRIPTION,
features=datasets.Features(
{
"text": datasets.Value("string"),
"label": datasets.features.ClassLabel(
names=[
"Negative",
"Positive",
]
),
}
),
supervised_keys=None,
homepage="https://github.com/komari6/Arabic-twitter-corpus-AJGT",
citation=_CITATION,
)

def _split_generators(self, dl_manager):
urls_to_download = {
"train": os.path.join(_URL, "AJGT.xlsx"),
}
downloaded_files = dl_manager.download(urls_to_download)
return [
datasets.SplitGenerator(name=datasets.Split.TRAIN, gen_kwargs={"filepath": downloaded_files["train"]}),
]

def _generate_examples(self, filepath):
"""Generate examples."""
# For labeled examples, extract the label from the path.
df = pd.read_excel(filepath)
for id_, record in df.iterrows():
tweet, sentiment = record["Feed"], record["Sentiment"]
yield str(id_), {"text": tweet, "label": sentiment}
1 change: 1 addition & 0 deletions datasets/ajgt_twitter_ar/dataset_infos.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"plain_text": {"description": "Arabic Jordanian General Tweets (AJGT) Corpus consisted of 1,800 tweets annotated as positive and negative. Modern Standard Arabic (MSA) or Jordanian dialect.\n", "citation": "@inproceedings{alomari2017arabic,\n title={Arabic tweets sentimental analysis using machine learning},\n author={Alomari, Khaled Mohammad and ElSherif, Hatem M and Shaalan, Khaled},\n booktitle={International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems},\n pages={602--610},\n year={2017},\n organization={Springer}\n}\n", "homepage": "https://github.com/komari6/Arabic-twitter-corpus-AJGT", "license": "", "features": {"text": {"dtype": "string", "id": null, "_type": "Value"}, "label": {"num_classes": 2, "names": ["Negative", "Positive"], "names_file": null, "id": null, "_type": "ClassLabel"}}, "post_processed": null, "supervised_keys": null, "builder_name": "ajgt_twitter_ar", "config_name": "plain_text", "version": {"version_str": "1.0.0", "description": "", "major": 1, "minor": 0, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 175424, "num_examples": 1800, "dataset_name": "ajgt_twitter_ar"}}, "download_checksums": {"https://raw.githubusercontent.com/komari6/Arabic-twitter-corpus-AJGT/master/AJGT.xlsx": {"num_bytes": 107395, "checksum": "966c52213872b6b8a3ced5fb7c60aee2abf47ca673c7d2c2eeb064a60bc9ed51"}}, "download_size": 107395, "post_processing_size": null, "dataset_size": 175424, "size_in_bytes": 282819}}
Binary file not shown.

0 comments on commit 4f8fb27

Please sign in to comment.