Add Siswati Ner corpus (huggingface#1111)

Co-authored-by: wambui <ywambui@heptanalytics.com>
katnoria · Dec 4, 2020 · ff5964a · ff5964a
1 parent aea9dcd
commit ff5964a
Show file tree

Hide file tree

Showing 4 changed files with 316 additions and 0 deletions.
diff --git a/datasets/siswati_ner_corpus/README.md b/datasets/siswati_ner_corpus/README.md
@@ -0,0 +1,171 @@
+---
+annotations_creators:
+- expert-generated
+language_creators:
+- found
+languages:
+- ss
+licenses:
+- other-Creative Commons Attribution 2.5 South Africa License
+multilinguality:
+- monolingual
+size_categories:
+- 10K<n<100K
+source_datasets:
+- original
+task_categories:
+- structure-prediction
+task_ids:
+- named-entity-recognition
+---
+
+# Dataset Card for Siswati NER Corpus
+
+## Table of Contents
+- [Dataset Description](#dataset-description)
+  - [Dataset Summary](#dataset-summary)
+  - [Supported Tasks](#supported-tasks-and-leaderboards)
+  - [Languages](#languages)
+- [Dataset Structure](#dataset-structure)
+  - [Data Instances](#data-instances)
+  - [Data Fields](#data-instances)
+  - [Data Splits](#data-instances)
+- [Dataset Creation](#dataset-creation)
+  - [Curation Rationale](#curation-rationale)
+  - [Source Data](#source-data)
+  - [Annotations](#annotations)
+  - [Personal and Sensitive Information](#personal-and-sensitive-information)
+- [Considerations for Using the Data](#considerations-for-using-the-data)
+  - [Social Impact of Dataset](#social-impact-of-dataset)
+  - [Discussion of Biases](#discussion-of-biases)
+  - [Other Known Limitations](#other-known-limitations)
+- [Additional Information](#additional-information)
+  - [Dataset Curators](#dataset-curators)
+  - [Licensing Information](#licensing-information)
+  - [Citation Information](#citation-information)
+
+## Dataset Description
+
+- **Homepage:** [Siswati Ner Corpus Homepage](https://repo.sadilar.org/handle/20.500.12185/346)
+- **Repository:**
+- **Paper:**
+- **Leaderboard:**
+- **Point of Contact:**  [Martin Puttkammer](mailto:Martin.Puttkammer@nwu.ac.za)
+
+### Dataset Summary
+
+The Siswati Ner Corpus is a Siswati dataset developed by [The Centre for Text Technology (CTexT), North-West University, South Africa](http://humanities.nwu.ac.za/ctext). The data is based on documents from the South African goverment domain and crawled from gov.za websites. It was created to support NER task for Siswati language. The dataset uses CoNLL shared task annotation standards.
+
+### Supported Tasks and Leaderboards
+
+[More Information Needed]
+
+### Languages
+
+The language supported is Siswati.
+
+## Dataset Structure
+
+### Data Instances
+
+A data point consists of sentences seperated by empty line and tab-seperated tokens and tags. 
+```
+{'id': '0',
+ 'ner_tags': [0, 0, 0, 0, 0],
+ 'tokens': ['Tinsita', 'tebantfu', ':', 'tinsita', 'tetakhamiti']
+}
+```
+
+### Data Fields
+
+- `id`: id of the sample
+- `tokens`: the tokens of the example text
+- `ner_tags`: the NER tags of each token
+
+The NER tags correspond to this list:
+```
+"OUT", "B-PERS", "I-PERS", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "B-MISC", "I-MISC",
+```
+The NER tags have the same format as in the CoNLL shared task: a B denotes the first item of a phrase and an I any non-initial word. There are four types of phrases: person names (PER), organizations (ORG), locations (LOC) and miscellaneous names (MISC). (OUT) is used for tokens not considered part of any named entity.
+
+### Data Splits
+
+The data was not split.
+
+## Dataset Creation
+
+### Curation Rationale
+
+The data was created to help introduce resources to new language - siswati.
+
+[More Information Needed]
+
+### Source Data
+
+#### Initial Data Collection and Normalization
+
+The data is based on South African government domain and was crawled from gov.za websites.
+
+#### Who are the source language producers?
+
+The data was produced by writers of South African government websites - gov.za
+
+[More Information Needed]
+
+### Annotations
+
+#### Annotation process
+
+[More Information Needed]
+
+#### Who are the annotators?
+
+The data was annotated during the NCHLT text resource development project.
+
+[More Information Needed]
+
+### Personal and Sensitive Information
+
+[More Information Needed]
+
+## Considerations for Using the Data
+
+### Social Impact of Dataset
+
+[More Information Needed]
+
+### Discussion of Biases
+
+[More Information Needed]
+
+### Other Known Limitations
+
+[More Information Needed]
+
+## Additional Information
+
+### Dataset Curators
+
+The annotated data sets were developed by the Centre for Text Technology (CTexT, North-West University, South Africa).
+
+See: [more information](http://www.nwu.ac.za/ctext)
+
+### Licensing Information
+
+The data is under the [Creative Commons Attribution 2.5 South Africa License](http://creativecommons.org/licenses/by/2.5/za/legalcode)
+
+### Citation Information
+
+```
+@inproceedings{siswati_ner_corpus,
+  author    = {B.B. Malangwane and
+               M.N. Kekana and
+               S.S. Sedibe and
+               B.C. Ndhlovu and
+              Roald Eiselen},
+  title     = {NCHLT Siswati Named Entity Annotated Corpus},
+  booktitle = {Eiselen, R. 2016. Government domain named entity recognition for South African languages. Proceedings of the 10th      Language Resource and Evaluation Conference, Portorož, Slovenia.},
+  year      = {2016},
+  url       = {https://repo.sadilar.org/handle/20.500.12185/346},
+}
+```
diff --git a/datasets/siswati_ner_corpus/dataset_infos.json b/datasets/siswati_ner_corpus/dataset_infos.json
@@ -0,0 +1 @@
+{"siswati_ner_corpus": {"description": "Named entity annotated data from the NCHLT Text Resource Development: Phase II Project, annotated with PERSON, LOCATION, ORGANISATION and MISCELLANEOUS tags.\n", "citation": "@inproceedings{siswati_ner_corpus,\n  author    = {B.B. Malangwane and\n               M.N. Kekana and\n               S.S. Sedibe and\n               B.C. Ndhlovu and\n              Roald Eiselen},\n  title     = {NCHLT Siswati Named Entity Annotated Corpus},\n  booktitle = {Eiselen, R. 2016. Government domain named entity recognition for South African languages. Proceedings of the 10th      Language Resource and Evaluation Conference, Portoro\u017e, Slovenia.},\n  year      = {2016},\n  url       = {https://repo.sadilar.org/handle/20.500.12185/346},\n}\n", "homepage": "https://repo.sadilar.org/handle/20.500.12185/346", "license": "Creative Commons Attribution 2.5 South Africa License", "features": {"id": {"dtype": "string", "id": null, "_type": "Value"}, "tokens": {"feature": {"dtype": "string", "id": null, "_type": "Value"}, "length": -1, "id": null, "_type": "Sequence"}, "ner_tags": {"feature": {"num_classes": 9, "names": ["OUT", "B-PERS", "I-PERS", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "B-MISC", "I-MISC"], "names_file": null, "id": null, "_type": "ClassLabel"}, "length": -1, "id": null, "_type": "Sequence"}}, "post_processed": null, "supervised_keys": null, "builder_name": "siswati_ner_corpus", "config_name": "siswati_ner_corpus", "version": {"version_str": "1.0.0", "description": null, "major": 1, "minor": 0, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 3517151, "num_examples": 10798, "dataset_name": "siswati_ner_corpus"}}, "download_checksums": {"https://repo.sadilar.org/bitstream/handle/20.500.12185/346/nchlt_siswati_named_entity_annotated_corpus.zip?sequence=3&isAllowed=y": {"num_bytes": 21882224, "checksum": "1939ae4161ecf8974cf35dfa57ef9d6c1b452f85942fecd7f2c1201b30a12b8d"}}, "download_size": 21882224, "post_processing_size": null, "dataset_size": 3517151, "size_in_bytes": 25399375}}
diff --git a/datasets/siswati_ner_corpus/dummy/siswati_ner_corpus/1.0.0/dummy_data.zip b/datasets/siswati_ner_corpus/dummy/siswati_ner_corpus/1.0.0/dummy_data.zip
diff --git a/datasets/siswati_ner_corpus/siswati_ner_corpus.py b/datasets/siswati_ner_corpus/siswati_ner_corpus.py
@@ -0,0 +1,144 @@
+# coding=utf-8
+# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Named entity annotated data from the NCHLT Text Resource Development: Phase II Project for Siswati"""
+
+from __future__ import absolute_import, division, print_function
+
+import logging
+import os
+
+import datasets
+
+
+_CITATION = """\
+@inproceedings{siswati_ner_corpus,
+  author    = {B.B. Malangwane and
+               M.N. Kekana and
+               S.S. Sedibe and
+               B.C. Ndhlovu and
+              Roald Eiselen},
+  title     = {NCHLT Siswati Named Entity Annotated Corpus},
+  booktitle = {Eiselen, R. 2016. Government domain named entity recognition for South African languages. Proceedings of the 10th      Language Resource and Evaluation Conference, Portorož, Slovenia.},
+  year      = {2016},
+  url       = {https://repo.sadilar.org/handle/20.500.12185/346},
+}
+"""
+
+
+_DESCRIPTION = """\
+Named entity annotated data from the NCHLT Text Resource Development: Phase II Project, annotated with PERSON, LOCATION, ORGANISATION and MISCELLANEOUS tags.
+"""
+
+
+_HOMEPAGE = "https://repo.sadilar.org/handle/20.500.12185/346"
+
+
+_LICENSE = "Creative Commons Attribution 2.5 South Africa License"
+
+
+# The HuggingFace dataset library don't host the datasets but only point to the original files
+# This can be an arbitrary nested dict/list of URLs (see below in `_split_generators` method)
+_URL = "https://repo.sadilar.org/bitstream/handle/20.500.12185/346/nchlt_siswati_named_entity_annotated_corpus.zip?sequence=3&isAllowed=y"
+
+_EXTRACTED_FILE = "NCHLT Siswati Named Entity Annotated Corpus/Dataset.NCHLT-II.ss.NER.Full.txt"
+
+
+class SiswatiNerCorpusConfig(datasets.BuilderConfig):
+    """BuilderConfig for SiswatiNerCorpus"""
+
+    def __init__(self, **kwargs):
+        """BuilderConfig for SiswatiNerCorpus.
+        Args:
+          **kwargs: keyword arguments forwarded to super.
+        """
+        super(SiswatiNerCorpusConfig, self).__init__(**kwargs)
+
+
+class SiswatiNerCorpus(datasets.GeneratorBasedBuilder):
+    """ SiswatiNerCorpus Ner dataset"""
+
+    BUILDER_CONFIGS = [
+        SiswatiNerCorpusConfig(
+            name="siswati_ner_corpus",
+            version=datasets.Version("1.0.0"),
+            description="SiswatiNerCorpus dataset",
+        ),
+    ]
+
+    def _info(self):
+        return datasets.DatasetInfo(
+            description=_DESCRIPTION,
+            features=datasets.Features(
+                {
+                    "id": datasets.Value("string"),
+                    "tokens": datasets.Sequence(datasets.Value("string")),
+                    "ner_tags": datasets.Sequence(
+                        datasets.features.ClassLabel(
+                            names=[
+                                "OUT",
+                                "B-PERS",
+                                "I-PERS",
+                                "B-ORG",
+                                "I-ORG",
+                                "B-LOC",
+                                "I-LOC",
+                                "B-MISC",
+                                "I-MISC",
+                            ]
+                        )
+                    ),
+                }
+            ),
+            supervised_keys=None,
+            homepage=_HOMEPAGE,
+            license=_LICENSE,
+            citation=_CITATION,
+        )
+
+    def _split_generators(self, dl_manager):
+        data_dir = dl_manager.download_and_extract(_URL)
+        return [
+            datasets.SplitGenerator(
+                name=datasets.Split.TRAIN,
+                gen_kwargs={"filepath": os.path.join(data_dir, _EXTRACTED_FILE)},
+            ),
+        ]
+
+    def _generate_examples(self, filepath):
+        logging.info("⏳ Generating examples from = %s", filepath)
+        with open(filepath, encoding="utf-8") as f:
+            guid = 0
+            tokens = []
+            ner_tags = []
+            for line in f:
+                if line == "" or line == "\n":
+                    if tokens:
+                        yield guid, {
+                            "id": str(guid),
+                            "tokens": tokens,
+                            "ner_tags": ner_tags,
+                        }
+                        guid += 1
+                        tokens = []
+                        ner_tags = []
+                else:
+                    splits = line.split("\t")
+                    tokens.append(splits[0])
+                    ner_tags.append(splits[1].rstrip())
+            yield guid, {
+                "id": str(guid),
+                "tokens": tokens,
+                "ner_tags": ner_tags,
+            }