Skip to content

Commit

Permalink
nkjp-ner (huggingface#1079)
Browse files Browse the repository at this point in the history
* adding nkjp-ner dataset

* adding nkjp-ner dataset

Co-authored-by: Michal Jamry <michal.jamry@yougov.com>
  • Loading branch information
2 people authored and sileod committed Dec 7, 2020
1 parent 2e472a9 commit c8e64c1
Show file tree
Hide file tree
Showing 4 changed files with 260 additions and 0 deletions.
153 changes: 153 additions & 0 deletions datasets/nkjp-ner/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
---
annotations_creators:
- expert-generated
language_creators:
- other
languages:
- pl
licenses:
- gpl-3.0
multilinguality:
- monolingual
size_categories:
- 10K<n<100K
source_datasets:
- original
task_categories:
- structure-prediction
task_ids:
- named-entity-recognition
---

# Dataset Card for [Dataset Name]

## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-instances)
- [Data Splits](#data-instances)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)

## Dataset Description

- **Homepage:**
http://nkjp.pl/index.php?page=0&lang=1
- **Repository:**
- **Paper:**
@book{przepiorkowski2012narodowy,
title={Narodowy korpus j{\k{e}}zyka polskiego},
author={Przepi{\'o}rkowski, Adam},
year={2012},
publisher={Naukowe PWN}
- **Leaderboard:**
- **Point of Contact:**
adamp@ipipan.waw.pl

### Dataset Summary

A linguistic corpus is a collection of texts where one can find the typical use of a single word or a phrase, as well as their meaning and grammatical function. Nowadays, without access to a language corpus, it has become impossible to do linguistic research, to write dictionaries, grammars and language teaching books, to create search engines sensitive to Polish inflection, machine translation engines and software of advanced language technology. Language corpora have become an essential tool for linguists, but they are also helpful for software engineers, scholars of literature and culture, historians, librarians and other specialists of art and computer sciences.
The manually annotated 1-million word subcorpus of the NJKP, available on GNU GPL v.3

### Supported Tasks and Leaderboards

Named entity recognition

[More Information Needed]

### Languages

Polish

## Dataset Structure

### Data Instances

Two tsv files (train, dev) with two columns (sentence, target) and one (test) with just one (sentence).

### Data Fields

- sentence
- target

### Data Splits

Data is splitted in train/dev/test split.

## Dataset Creation

### Curation Rationale

This dataset is one of nine evaluation tasks to improve polish language processing.

### Source Data

#### Initial Data Collection and Normalization

[More Information Needed]

#### Who are the source language producers?

[More Information Needed]

### Annotations

#### Annotation process

[More Information Needed]

#### Who are the annotators?

[More Information Needed]

### Personal and Sensitive Information

[More Information Needed]

## Considerations for Using the Data

### Social Impact of Dataset

[More Information Needed]

### Discussion of Biases

[More Information Needed]

### Other Known Limitations

[More Information Needed]

## Additional Information

### Dataset Curators

[More Information Needed]

### Licensing Information

GNU GPL v.3

### Citation Information

@book{przepiorkowski2012narodowy,
title={Narodowy korpus j{\k{e}}zyka polskiego},
author={Przepi{\'o}rkowski, Adam},
year={2012},
publisher={Naukowe PWN}
}
1 change: 1 addition & 0 deletions datasets/nkjp-ner/dataset_infos.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"default": {"description": "The NKJP-NER is based on a human-annotated part of National Corpus of Polish (NKJP). We extracted sentences with named entities of exactly one type. The task is to predict the type of the named entity.\n", "citation": "@book{przepiorkowski2012narodowy,\ntitle={Narodowy korpus j{\\k{e}}zyka polskiego},\nauthor={Przepi{'o}rkowski, Adam},\nyear={2012},\npublisher={Naukowe PWN}\n}\n", "homepage": "https://klejbenchmark.com/tasks/", "license": "GNU GPL v.3", "features": {"sentence": {"dtype": "string", "id": null, "_type": "Value"}, "target": {"num_classes": 6, "names": ["geogName", "noEntity", "orgName", "persName", "placeName", "time"], "names_file": null, "id": null, "_type": "ClassLabel"}}, "post_processed": null, "supervised_keys": null, "builder_name": "nkjp_ner", "config_name": "default", "version": {"version_str": "1.1.0", "description": null, "major": 1, "minor": 1, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 1612125, "num_examples": 15794, "dataset_name": "nkjp_ner"}, "test": {"name": "test", "num_bytes": 221092, "num_examples": 2058, "dataset_name": "nkjp_ner"}, "validation": {"name": "validation", "num_bytes": 196652, "num_examples": 1941, "dataset_name": "nkjp_ner"}}, "download_checksums": {"https://klejbenchmark.com/static/data/klej_nkjp-ner.zip": {"num_bytes": 821629, "checksum": "4b4573219731b198d43958e347dcd3e83654c89daa980c88de3bec8d628044ac"}}, "download_size": 821629, "post_processing_size": null, "dataset_size": 2029869, "size_in_bytes": 2851498}}
Binary file added datasets/nkjp-ner/dummy/1.1.0/dummy_data.zip
Binary file not shown.
106 changes: 106 additions & 0 deletions datasets/nkjp-ner/nkjp-ner.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
# coding=utf-8
# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""NKJP-NER"""

from __future__ import absolute_import, division, print_function

import csv
import os

import datasets


_CITATION = """\
@book{przepiorkowski2012narodowy,
title={Narodowy korpus jezyka polskiego},
author={Przepi{\'o}rkowski, Adam},
year={2012},
publisher={Naukowe PWN}
}
"""

_DESCRIPTION = """\
The NKJP-NER is based on a human-annotated part of National Corpus of Polish (NKJP). We extracted sentences with named entities of exactly one type. The task is to predict the type of the named entity.
"""

_HOMEPAGE = "https://klejbenchmark.com/tasks/"

_LICENSE = "GNU GPL v.3"

_URLs = "https://klejbenchmark.com/static/data/klej_nkjp-ner.zip"


class NkjpNer(datasets.GeneratorBasedBuilder):
"""NKJP-NER"""

VERSION = datasets.Version("1.1.0")

def _info(self):
return datasets.DatasetInfo(
description=_DESCRIPTION,
features=datasets.Features(
{
"sentence": datasets.Value("string"),
"target": datasets.ClassLabel(
names=[
"geogName",
"noEntity",
"orgName",
"persName",
"placeName",
"time",
]
),
}
),
supervised_keys=None,
homepage=_HOMEPAGE,
license=_LICENSE,
citation=_CITATION,
)

def _split_generators(self, dl_manager):
"""Returns SplitGenerators."""
data_dir = dl_manager.download_and_extract(_URLs)
return [
datasets.SplitGenerator(
name=datasets.Split.TRAIN,
gen_kwargs={
"filepath": os.path.join(data_dir, "train.tsv"),
"split": "train",
},
),
datasets.SplitGenerator(
name=datasets.Split.TEST,
gen_kwargs={"filepath": os.path.join(data_dir, "test_features.tsv"), "split": "test"},
),
datasets.SplitGenerator(
name=datasets.Split.VALIDATION,
gen_kwargs={
"filepath": os.path.join(data_dir, "dev.tsv"),
"split": "dev",
},
),
]

def _generate_examples(self, filepath, split):
""" Yields examples. """
with open(filepath, encoding="utf-8") as f:
reader = csv.DictReader(f, delimiter="\t", quoting=csv.QUOTE_NONE)
for id_, row in enumerate(reader):
yield id_, {
"sentence": row["sentence"],
"target": -1 if split == "test" else row["target"],
}

0 comments on commit c8e64c1

Please sign in to comment.