add AJGT dataset (huggingface#1078)

* add AJGT dataset * Update datasets/ajgt_twitter_ar/ajgt_twitter_ar.py * Update datasets/ajgt_twitter_ar/dataset_infos.json Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
sileod · Dec 7, 2020 · 4f8fb27 · 4f8fb27
1 parent c8e64c1
commit 4f8fb27
Show file tree

Hide file tree

Showing 4 changed files with 243 additions and 0 deletions.
diff --git a/datasets/ajgt_twitter_ar/README.md b/datasets/ajgt_twitter_ar/README.md
@@ -0,0 +1,139 @@
+---
+annotations_creators:
+- found
+language_creators:
+- found
+languages:
+- ar
+licenses:
+- unknown
+multilinguality:
+- monolingual
+size_categories:
+- 1k<n<10k
+source_datasets:
+- original
+task_categories:
+- text_classification
+task_ids:
+- sentiment-classification
+---
+
+# Dataset Card for MetRec
+
+## Table of Contents
+- [Dataset Description](#dataset-description)
+  - [Dataset Summary](#dataset-summary)
+  - [Supported Tasks](#supported-tasks-and-leaderboards)
+  - [Languages](#languages)
+- [Dataset Structure](#dataset-structure)
+  - [Data Instances](#data-instances)
+  - [Data Fields](#data-instances)
+  - [Data Splits](#data-instances)
+- [Dataset Creation](#dataset-creation)
+  - [Curation Rationale](#curation-rationale)
+  - [Source Data](#source-data)
+  - [Annotations](#annotations)
+  - [Personal and Sensitive Information](#personal-and-sensitive-information)
+- [Considerations for Using the Data](#considerations-for-using-the-data)
+  - [Discussion of Social Impact and Biases](#discussion-of-social-impact-and-biases)
+  - [Other Known Limitations](#other-known-limitations)
+- [Additional Information](#additional-information)
+  - [Dataset Curators](#dataset-curators)
+  - [Licensing Information](#licensing-information)
+  - [Citation Information](#citation-information)
+
+## Dataset Description
+
+- **Homepage:** [AJGT](https://github.com/komari6/Arabic-twitter-corpus-AJGT)
+- **Repository:** [AJGT](https://github.com/komari6/Arabic-twitter-corpus-AJGT)
+- **Paper:** [Arabic Tweets Sentimental Analysis Using Machine Learning](https://link.springer.com/chapter/10.1007/978-3-319-60042-0_66)
+- **Point of Contact:** [Khaled Alomari](khaled.alomari@adu.ac.ae)
+
+### Dataset Summary
+
+Arabic Jordanian General Tweets (AJGT) Corpus consisted of 1,800 tweets annotated as positive and negative. Modern Standard Arabic (MSA) or Jordanian dialect.
+
+### Supported Tasks and Leaderboards
+
+The dataset was published on this [paper](https://link.springer.com/chapter/10.1007/978-3-319-60042-0_66). 
+
+### Languages
+
+The dataset is based on Arabic.
+
+## Dataset Structure
+
+### Data Instances
+
+A binary datset with with negative and positive sentiments.  
+
+### Data Fields
+
+[More Information Needed]
+
+### Data Splits
+
+The dataset is not split. 
+
+|           | Tain   | 
+|---------- | ------ | 
+|no split   | 1,800  | 
+
+## Dataset Creation
+
+### Curation Rationale
+
+[More Information Needed]
+
+### Source Data
+
+[More Information Needed]
+
+#### Initial Data Collection and Normalization
+
+Contains 1,800 tweets collected from twitter. 
+
+#### Who are the source language producers?
+
+From tweeter.  
+
+### Annotations
+
+The dataset does not contain any additional annotations.
+
+#### Annotation process
+
+[More Information Needed]
+
+#### Who are the annotators?
+
+[More Information Needed]
+
+### Personal and Sensitive Information
+
+[More Information Needed]
+
+## Considerations for Using the Data
+
+### Discussion of Social Impact and Biases
+
+[More Information Needed]
+
+### Other Known Limitations
+
+[More Information Needed]
+
+## Additional Information
+
+### Dataset Curators
+
+[More Information Needed]
+
+### Licensing Information
+
+[More Information Needed]
+
+### Citation Information
+
+[More Information Needed]
diff --git a/datasets/ajgt_twitter_ar/ajgt_twitter_ar.py b/datasets/ajgt_twitter_ar/ajgt_twitter_ar.py
@@ -0,0 +1,103 @@
+# coding=utf-8
+# Copyright 2020 The TensorFlow Datasets Authors and the HuggingFace Datasets Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Lint as: python3
+"""Arabic Jordanian General Tweets."""
+
+from __future__ import absolute_import, division, print_function
+
+import os
+
+import pandas as pd
+
+import datasets
+
+
+_DESCRIPTION = """\
+Arabic Jordanian General Tweets (AJGT) Corpus consisted of 1,800 tweets \
+annotated as positive and negative. Modern Standard Arabic (MSA) or Jordanian dialect.
+"""
+
+_CITATION = """\
+@inproceedings{alomari2017arabic,
+  title={Arabic tweets sentimental analysis using machine learning},
+  author={Alomari, Khaled Mohammad and ElSherif, Hatem M and Shaalan, Khaled},
+  booktitle={International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems},
+  pages={602--610},
+  year={2017},
+  organization={Springer}
+}
+"""
+
+_URL = "https://raw.githubusercontent.com/komari6/Arabic-twitter-corpus-AJGT/master/"
+
+
+class AjgtConfig(datasets.BuilderConfig):
+    """BuilderConfig for Ajgt."""
+
+    def __init__(self, **kwargs):
+        """BuilderConfig for Ajgt.
+
+        Args:
+          **kwargs: keyword arguments forwarded to super.
+        """
+        super(AjgtConfig, self).__init__(version=datasets.Version("1.0.0", ""), **kwargs)
+
+
+class AjgtTwitterAr(datasets.GeneratorBasedBuilder):
+    """Ajgt dataset."""
+
+    BUILDER_CONFIGS = [
+        AjgtConfig(
+            name="plain_text",
+            description="Plain text",
+        )
+    ]
+
+    def _info(self):
+        return datasets.DatasetInfo(
+            description=_DESCRIPTION,
+            features=datasets.Features(
+                {
+                    "text": datasets.Value("string"),
+                    "label": datasets.features.ClassLabel(
+                        names=[
+                            "Negative",
+                            "Positive",
+                        ]
+                    ),
+                }
+            ),
+            supervised_keys=None,
+            homepage="https://github.com/komari6/Arabic-twitter-corpus-AJGT",
+            citation=_CITATION,
+        )
+
+    def _split_generators(self, dl_manager):
+        urls_to_download = {
+            "train": os.path.join(_URL, "AJGT.xlsx"),
+        }
+        downloaded_files = dl_manager.download(urls_to_download)
+        return [
+            datasets.SplitGenerator(name=datasets.Split.TRAIN, gen_kwargs={"filepath": downloaded_files["train"]}),
+        ]
+
+    def _generate_examples(self, filepath):
+        """Generate examples."""
+        # For labeled examples, extract the label from the path.
+        df = pd.read_excel(filepath)
+        for id_, record in df.iterrows():
+            tweet, sentiment = record["Feed"], record["Sentiment"]
+            yield str(id_), {"text": tweet, "label": sentiment}
diff --git a/datasets/ajgt_twitter_ar/dataset_infos.json b/datasets/ajgt_twitter_ar/dataset_infos.json
@@ -0,0 +1 @@
+{"plain_text": {"description": "Arabic Jordanian General Tweets (AJGT) Corpus consisted of 1,800 tweets annotated as positive and negative. Modern Standard Arabic (MSA) or Jordanian dialect.\n", "citation": "@inproceedings{alomari2017arabic,\n  title={Arabic tweets sentimental analysis using machine learning},\n  author={Alomari, Khaled Mohammad and ElSherif, Hatem M and Shaalan, Khaled},\n  booktitle={International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems},\n  pages={602--610},\n  year={2017},\n  organization={Springer}\n}\n", "homepage": "https://github.com/komari6/Arabic-twitter-corpus-AJGT", "license": "", "features": {"text": {"dtype": "string", "id": null, "_type": "Value"}, "label": {"num_classes": 2, "names": ["Negative", "Positive"], "names_file": null, "id": null, "_type": "ClassLabel"}}, "post_processed": null, "supervised_keys": null, "builder_name": "ajgt_twitter_ar", "config_name": "plain_text", "version": {"version_str": "1.0.0", "description": "", "major": 1, "minor": 0, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 175424, "num_examples": 1800, "dataset_name": "ajgt_twitter_ar"}}, "download_checksums": {"https://raw.githubusercontent.com/komari6/Arabic-twitter-corpus-AJGT/master/AJGT.xlsx": {"num_bytes": 107395, "checksum": "966c52213872b6b8a3ced5fb7c60aee2abf47ca673c7d2c2eeb064a60bc9ed51"}}, "download_size": 107395, "post_processing_size": null, "dataset_size": 175424, "size_in_bytes": 282819}}
diff --git a/datasets/ajgt_twitter_ar/dummy/plain_text/1.0.0/dummy_data.zip b/datasets/ajgt_twitter_ar/dummy/plain_text/1.0.0/dummy_data.zip