nkjp-ner (huggingface#1079)

* adding nkjp-ner dataset * adding nkjp-ner dataset Co-authored-by: Michal Jamry <michal.jamry@yougov.com>
sileod · Dec 7, 2020 · c8e64c1 · c8e64c1
1 parent 2e472a9
commit c8e64c1
Show file tree

Hide file tree

Showing 4 changed files with 260 additions and 0 deletions.
diff --git a/datasets/nkjp-ner/README.md b/datasets/nkjp-ner/README.md
@@ -0,0 +1,153 @@
+---
+annotations_creators:
+- expert-generated
+language_creators:
+- other
+languages:
+- pl
+licenses:
+- gpl-3.0
+multilinguality:
+- monolingual
+size_categories:
+- 10K<n<100K
+source_datasets:
+- original
+task_categories:
+- structure-prediction
+task_ids:
+- named-entity-recognition
+---
+
+# Dataset Card for [Dataset Name]
+
+## Table of Contents
+- [Dataset Description](#dataset-description)
+  - [Dataset Summary](#dataset-summary)
+  - [Supported Tasks](#supported-tasks-and-leaderboards)
+  - [Languages](#languages)
+- [Dataset Structure](#dataset-structure)
+  - [Data Instances](#data-instances)
+  - [Data Fields](#data-instances)
+  - [Data Splits](#data-instances)
+- [Dataset Creation](#dataset-creation)
+  - [Curation Rationale](#curation-rationale)
+  - [Source Data](#source-data)
+  - [Annotations](#annotations)
+  - [Personal and Sensitive Information](#personal-and-sensitive-information)
+- [Considerations for Using the Data](#considerations-for-using-the-data)
+  - [Social Impact of Dataset](#social-impact-of-dataset)
+  - [Discussion of Biases](#discussion-of-biases)
+  - [Other Known Limitations](#other-known-limitations)
+- [Additional Information](#additional-information)
+  - [Dataset Curators](#dataset-curators)
+  - [Licensing Information](#licensing-information)
+  - [Citation Information](#citation-information)
+
+## Dataset Description
+
+- **Homepage:**
+http://nkjp.pl/index.php?page=0&lang=1
+- **Repository:**
+- **Paper:**
+@book{przepiorkowski2012narodowy,
+title={Narodowy korpus j{\k{e}}zyka polskiego},
+author={Przepi{\'o}rkowski, Adam},
+year={2012},
+publisher={Naukowe PWN}
+- **Leaderboard:**
+- **Point of Contact:**
+adamp@ipipan.waw.pl
+
+### Dataset Summary
+
+A linguistic corpus is a collection of texts where one can find the typical use of a single word or a phrase, as well as their meaning and grammatical function. Nowadays, without access to a language corpus, it has become impossible to do linguistic research, to write dictionaries, grammars and language teaching books, to create search engines sensitive to Polish inflection, machine translation engines and software of advanced language technology. Language corpora have become an essential tool for linguists, but they are also helpful for software engineers, scholars of literature and culture, historians, librarians and other specialists of art and computer sciences.
+The manually annotated 1-million word subcorpus of the NJKP, available on GNU GPL v.3
+
+### Supported Tasks and Leaderboards
+
+Named entity recognition
+
+[More Information Needed]
+
+### Languages
+
+Polish
+
+## Dataset Structure
+
+### Data Instances
+
+Two tsv files (train, dev) with two columns (sentence, target) and one (test) with just one (sentence). 
+
+### Data Fields
+
+- sentence
+- target
+
+### Data Splits
+
+Data is splitted in train/dev/test split.
+
+## Dataset Creation
+
+### Curation Rationale
+
+This dataset is one of nine evaluation tasks to improve polish language processing.
+
+### Source Data
+
+#### Initial Data Collection and Normalization
+
+[More Information Needed]
+
+#### Who are the source language producers?
+
+[More Information Needed]
+
+### Annotations
+
+#### Annotation process
+
+[More Information Needed]
+
+#### Who are the annotators?
+
+[More Information Needed]
+
+### Personal and Sensitive Information
+
+[More Information Needed]
+
+## Considerations for Using the Data
+
+### Social Impact of Dataset
+
+[More Information Needed]
+
+### Discussion of Biases
+
+[More Information Needed]
+
+### Other Known Limitations
+
+[More Information Needed]
+
+## Additional Information
+
+### Dataset Curators
+
+[More Information Needed]
+
+### Licensing Information
+
+GNU GPL v.3
+
+### Citation Information
+
+@book{przepiorkowski2012narodowy,
+title={Narodowy korpus j{\k{e}}zyka polskiego},
+author={Przepi{\'o}rkowski, Adam},
+year={2012},
+publisher={Naukowe PWN}
+}
diff --git a/datasets/nkjp-ner/dataset_infos.json b/datasets/nkjp-ner/dataset_infos.json
@@ -0,0 +1 @@
+{"default": {"description": "The NKJP-NER is based on a human-annotated part of National Corpus of Polish (NKJP). We extracted sentences with named entities of exactly one type. The task is to predict the type of the named entity.\n", "citation": "@book{przepiorkowski2012narodowy,\ntitle={Narodowy korpus j{\\k{e}}zyka polskiego},\nauthor={Przepi{'o}rkowski, Adam},\nyear={2012},\npublisher={Naukowe PWN}\n}\n", "homepage": "https://klejbenchmark.com/tasks/", "license": "GNU GPL v.3", "features": {"sentence": {"dtype": "string", "id": null, "_type": "Value"}, "target": {"num_classes": 6, "names": ["geogName", "noEntity", "orgName", "persName", "placeName", "time"], "names_file": null, "id": null, "_type": "ClassLabel"}}, "post_processed": null, "supervised_keys": null, "builder_name": "nkjp_ner", "config_name": "default", "version": {"version_str": "1.1.0", "description": null, "major": 1, "minor": 1, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 1612125, "num_examples": 15794, "dataset_name": "nkjp_ner"}, "test": {"name": "test", "num_bytes": 221092, "num_examples": 2058, "dataset_name": "nkjp_ner"}, "validation": {"name": "validation", "num_bytes": 196652, "num_examples": 1941, "dataset_name": "nkjp_ner"}}, "download_checksums": {"https://klejbenchmark.com/static/data/klej_nkjp-ner.zip": {"num_bytes": 821629, "checksum": "4b4573219731b198d43958e347dcd3e83654c89daa980c88de3bec8d628044ac"}}, "download_size": 821629, "post_processing_size": null, "dataset_size": 2029869, "size_in_bytes": 2851498}}
diff --git a/datasets/nkjp-ner/dummy/1.1.0/dummy_data.zip b/datasets/nkjp-ner/dummy/1.1.0/dummy_data.zip
diff --git a/datasets/nkjp-ner/nkjp-ner.py b/datasets/nkjp-ner/nkjp-ner.py
@@ -0,0 +1,106 @@
+# coding=utf-8
+# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""NKJP-NER"""
+
+from __future__ import absolute_import, division, print_function
+
+import csv
+import os
+
+import datasets
+
+
+_CITATION = """\
+@book{przepiorkowski2012narodowy,
+title={Narodowy korpus jezyka polskiego},
+author={Przepi{\'o}rkowski, Adam},
+year={2012},
+publisher={Naukowe PWN}
+}
+"""
+
+_DESCRIPTION = """\
+The NKJP-NER is based on a human-annotated part of National Corpus of Polish (NKJP). We extracted sentences with named entities of exactly one type. The task is to predict the type of the named entity.
+"""
+
+_HOMEPAGE = "https://klejbenchmark.com/tasks/"
+
+_LICENSE = "GNU GPL v.3"
+
+_URLs = "https://klejbenchmark.com/static/data/klej_nkjp-ner.zip"
+
+
+class NkjpNer(datasets.GeneratorBasedBuilder):
+    """NKJP-NER"""
+
+    VERSION = datasets.Version("1.1.0")
+
+    def _info(self):
+        return datasets.DatasetInfo(
+            description=_DESCRIPTION,
+            features=datasets.Features(
+                {
+                    "sentence": datasets.Value("string"),
+                    "target": datasets.ClassLabel(
+                        names=[
+                            "geogName",
+                            "noEntity",
+                            "orgName",
+                            "persName",
+                            "placeName",
+                            "time",
+                        ]
+                    ),
+                }
+            ),
+            supervised_keys=None,
+            homepage=_HOMEPAGE,
+            license=_LICENSE,
+            citation=_CITATION,
+        )
+
+    def _split_generators(self, dl_manager):
+        """Returns SplitGenerators."""
+        data_dir = dl_manager.download_and_extract(_URLs)
+        return [
+            datasets.SplitGenerator(
+                name=datasets.Split.TRAIN,
+                gen_kwargs={
+                    "filepath": os.path.join(data_dir, "train.tsv"),
+                    "split": "train",
+                },
+            ),
+            datasets.SplitGenerator(
+                name=datasets.Split.TEST,
+                gen_kwargs={"filepath": os.path.join(data_dir, "test_features.tsv"), "split": "test"},
+            ),
+            datasets.SplitGenerator(
+                name=datasets.Split.VALIDATION,
+                gen_kwargs={
+                    "filepath": os.path.join(data_dir, "dev.tsv"),
+                    "split": "dev",
+                },
+            ),
+        ]
+
+    def _generate_examples(self, filepath, split):
+        """ Yields examples. """
+        with open(filepath, encoding="utf-8") as f:
+            reader = csv.DictReader(f, delimiter="\t", quoting=csv.QUOTE_NONE)
+            for id_, row in enumerate(reader):
+                yield id_, {
+                    "sentence": row["sentence"],
+                    "target": -1 if split == "test" else row["target"],
+                }