forked from huggingface/datasets
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* add AJGT dataset * Update datasets/ajgt_twitter_ar/ajgt_twitter_ar.py * Update datasets/ajgt_twitter_ar/dataset_infos.json Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
- Loading branch information
Showing
4 changed files
with
243 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,139 @@ | ||
--- | ||
annotations_creators: | ||
- found | ||
language_creators: | ||
- found | ||
languages: | ||
- ar | ||
licenses: | ||
- unknown | ||
multilinguality: | ||
- monolingual | ||
size_categories: | ||
- 1k<n<10k | ||
source_datasets: | ||
- original | ||
task_categories: | ||
- text_classification | ||
task_ids: | ||
- sentiment-classification | ||
--- | ||
|
||
# Dataset Card for MetRec | ||
|
||
## Table of Contents | ||
- [Dataset Description](#dataset-description) | ||
- [Dataset Summary](#dataset-summary) | ||
- [Supported Tasks](#supported-tasks-and-leaderboards) | ||
- [Languages](#languages) | ||
- [Dataset Structure](#dataset-structure) | ||
- [Data Instances](#data-instances) | ||
- [Data Fields](#data-instances) | ||
- [Data Splits](#data-instances) | ||
- [Dataset Creation](#dataset-creation) | ||
- [Curation Rationale](#curation-rationale) | ||
- [Source Data](#source-data) | ||
- [Annotations](#annotations) | ||
- [Personal and Sensitive Information](#personal-and-sensitive-information) | ||
- [Considerations for Using the Data](#considerations-for-using-the-data) | ||
- [Discussion of Social Impact and Biases](#discussion-of-social-impact-and-biases) | ||
- [Other Known Limitations](#other-known-limitations) | ||
- [Additional Information](#additional-information) | ||
- [Dataset Curators](#dataset-curators) | ||
- [Licensing Information](#licensing-information) | ||
- [Citation Information](#citation-information) | ||
|
||
## Dataset Description | ||
|
||
- **Homepage:** [AJGT](https://github.com/komari6/Arabic-twitter-corpus-AJGT) | ||
- **Repository:** [AJGT](https://github.com/komari6/Arabic-twitter-corpus-AJGT) | ||
- **Paper:** [Arabic Tweets Sentimental Analysis Using Machine Learning](https://link.springer.com/chapter/10.1007/978-3-319-60042-0_66) | ||
- **Point of Contact:** [Khaled Alomari](khaled.alomari@adu.ac.ae) | ||
|
||
### Dataset Summary | ||
|
||
Arabic Jordanian General Tweets (AJGT) Corpus consisted of 1,800 tweets annotated as positive and negative. Modern Standard Arabic (MSA) or Jordanian dialect. | ||
|
||
### Supported Tasks and Leaderboards | ||
|
||
The dataset was published on this [paper](https://link.springer.com/chapter/10.1007/978-3-319-60042-0_66). | ||
|
||
### Languages | ||
|
||
The dataset is based on Arabic. | ||
|
||
## Dataset Structure | ||
|
||
### Data Instances | ||
|
||
A binary datset with with negative and positive sentiments. | ||
|
||
### Data Fields | ||
|
||
[More Information Needed] | ||
|
||
### Data Splits | ||
|
||
The dataset is not split. | ||
|
||
| | Tain | | ||
|---------- | ------ | | ||
|no split | 1,800 | | ||
|
||
## Dataset Creation | ||
|
||
### Curation Rationale | ||
|
||
[More Information Needed] | ||
|
||
### Source Data | ||
|
||
[More Information Needed] | ||
|
||
#### Initial Data Collection and Normalization | ||
|
||
Contains 1,800 tweets collected from twitter. | ||
|
||
#### Who are the source language producers? | ||
|
||
From tweeter. | ||
|
||
### Annotations | ||
|
||
The dataset does not contain any additional annotations. | ||
|
||
#### Annotation process | ||
|
||
[More Information Needed] | ||
|
||
#### Who are the annotators? | ||
|
||
[More Information Needed] | ||
|
||
### Personal and Sensitive Information | ||
|
||
[More Information Needed] | ||
|
||
## Considerations for Using the Data | ||
|
||
### Discussion of Social Impact and Biases | ||
|
||
[More Information Needed] | ||
|
||
### Other Known Limitations | ||
|
||
[More Information Needed] | ||
|
||
## Additional Information | ||
|
||
### Dataset Curators | ||
|
||
[More Information Needed] | ||
|
||
### Licensing Information | ||
|
||
[More Information Needed] | ||
|
||
### Citation Information | ||
|
||
[More Information Needed] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,103 @@ | ||
# coding=utf-8 | ||
# Copyright 2020 The TensorFlow Datasets Authors and the HuggingFace Datasets Authors. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
# Lint as: python3 | ||
"""Arabic Jordanian General Tweets.""" | ||
|
||
from __future__ import absolute_import, division, print_function | ||
|
||
import os | ||
|
||
import pandas as pd | ||
|
||
import datasets | ||
|
||
|
||
_DESCRIPTION = """\ | ||
Arabic Jordanian General Tweets (AJGT) Corpus consisted of 1,800 tweets \ | ||
annotated as positive and negative. Modern Standard Arabic (MSA) or Jordanian dialect. | ||
""" | ||
|
||
_CITATION = """\ | ||
@inproceedings{alomari2017arabic, | ||
title={Arabic tweets sentimental analysis using machine learning}, | ||
author={Alomari, Khaled Mohammad and ElSherif, Hatem M and Shaalan, Khaled}, | ||
booktitle={International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems}, | ||
pages={602--610}, | ||
year={2017}, | ||
organization={Springer} | ||
} | ||
""" | ||
|
||
_URL = "https://raw.githubusercontent.com/komari6/Arabic-twitter-corpus-AJGT/master/" | ||
|
||
|
||
class AjgtConfig(datasets.BuilderConfig): | ||
"""BuilderConfig for Ajgt.""" | ||
|
||
def __init__(self, **kwargs): | ||
"""BuilderConfig for Ajgt. | ||
Args: | ||
**kwargs: keyword arguments forwarded to super. | ||
""" | ||
super(AjgtConfig, self).__init__(version=datasets.Version("1.0.0", ""), **kwargs) | ||
|
||
|
||
class AjgtTwitterAr(datasets.GeneratorBasedBuilder): | ||
"""Ajgt dataset.""" | ||
|
||
BUILDER_CONFIGS = [ | ||
AjgtConfig( | ||
name="plain_text", | ||
description="Plain text", | ||
) | ||
] | ||
|
||
def _info(self): | ||
return datasets.DatasetInfo( | ||
description=_DESCRIPTION, | ||
features=datasets.Features( | ||
{ | ||
"text": datasets.Value("string"), | ||
"label": datasets.features.ClassLabel( | ||
names=[ | ||
"Negative", | ||
"Positive", | ||
] | ||
), | ||
} | ||
), | ||
supervised_keys=None, | ||
homepage="https://github.com/komari6/Arabic-twitter-corpus-AJGT", | ||
citation=_CITATION, | ||
) | ||
|
||
def _split_generators(self, dl_manager): | ||
urls_to_download = { | ||
"train": os.path.join(_URL, "AJGT.xlsx"), | ||
} | ||
downloaded_files = dl_manager.download(urls_to_download) | ||
return [ | ||
datasets.SplitGenerator(name=datasets.Split.TRAIN, gen_kwargs={"filepath": downloaded_files["train"]}), | ||
] | ||
|
||
def _generate_examples(self, filepath): | ||
"""Generate examples.""" | ||
# For labeled examples, extract the label from the path. | ||
df = pd.read_excel(filepath) | ||
for id_, record in df.iterrows(): | ||
tweet, sentiment = record["Feed"], record["Sentiment"] | ||
yield str(id_), {"text": tweet, "label": sentiment} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
{"plain_text": {"description": "Arabic Jordanian General Tweets (AJGT) Corpus consisted of 1,800 tweets annotated as positive and negative. Modern Standard Arabic (MSA) or Jordanian dialect.\n", "citation": "@inproceedings{alomari2017arabic,\n title={Arabic tweets sentimental analysis using machine learning},\n author={Alomari, Khaled Mohammad and ElSherif, Hatem M and Shaalan, Khaled},\n booktitle={International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems},\n pages={602--610},\n year={2017},\n organization={Springer}\n}\n", "homepage": "https://github.com/komari6/Arabic-twitter-corpus-AJGT", "license": "", "features": {"text": {"dtype": "string", "id": null, "_type": "Value"}, "label": {"num_classes": 2, "names": ["Negative", "Positive"], "names_file": null, "id": null, "_type": "ClassLabel"}}, "post_processed": null, "supervised_keys": null, "builder_name": "ajgt_twitter_ar", "config_name": "plain_text", "version": {"version_str": "1.0.0", "description": "", "major": 1, "minor": 0, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 175424, "num_examples": 1800, "dataset_name": "ajgt_twitter_ar"}}, "download_checksums": {"https://raw.githubusercontent.com/komari6/Arabic-twitter-corpus-AJGT/master/AJGT.xlsx": {"num_bytes": 107395, "checksum": "966c52213872b6b8a3ced5fb7c60aee2abf47ca673c7d2c2eeb064a60bc9ed51"}}, "download_size": 107395, "post_processing_size": null, "dataset_size": 175424, "size_in_bytes": 282819}} |
Binary file not shown.