Add OPUS DOGC dataset (huggingface#987)

* Add OPUS DOGC dataset * Add dummy data * Add dataset_infos.json * Add tags to README * Use Translation feature * Update dataset_infos.json * Update datasets/opus_dogc/README.md * Fix README * Update README.md Co-authored-by: Yacine Jernite <yjernite@users.noreply.github.com>
ggdupont · Dec 4, 2020 · 66d7a1e · 66d7a1e
1 parent 4ad157b
commit 66d7a1e
Show file tree

Hide file tree

Showing 4 changed files with 257 additions and 0 deletions.
diff --git a/datasets/opus_dogc/README.md b/datasets/opus_dogc/README.md
@@ -0,0 +1,155 @@
+---
+annotations_creators:
+- no-annotation
+language_creators:
+- expert-generated
+languages:
+- ca
+- es
+licenses:
+- cc0-1.0
+multilinguality:
+- translation
+size_categories:
+- n>1M
+source_datasets:
+- original
+task_categories:
+- conditional-text-generation
+task_ids:
+- machine-translation
+---
+
+# Dataset Card for OPUS DOGC
+
+## Table of Contents
+- [Dataset Description](#dataset-description)
+  - [Dataset Summary](#dataset-summary)
+  - [Supported Tasks](#supported-tasks-and-leaderboards)
+  - [Languages](#languages)
+- [Dataset Structure](#dataset-structure)
+  - [Data Instances](#data-instances)
+  - [Data Fields](#data-instances)
+  - [Data Splits](#data-instances)
+- [Dataset Creation](#dataset-creation)
+  - [Curation Rationale](#curation-rationale)
+  - [Source Data](#source-data)
+  - [Annotations](#annotations)
+  - [Personal and Sensitive Information](#personal-and-sensitive-information)
+- [Considerations for Using the Data](#considerations-for-using-the-data)
+  - [Social Impact of Dataset](#social-impact-of-dataset)
+  - [Discussion of Biases](#discussion-of-biases)
+  - [Other Known Limitations](#other-known-limitations)
+- [Additional Information](#additional-information)
+  - [Dataset Curators](#dataset-curators)
+  - [Licensing Information](#licensing-information)
+  - [Citation Information](#citation-information)
+
+## Dataset Description
+
+- **Homepage:** http://opus.nlpl.eu/DOGC.php
+- **Repository:**
+- **Paper:**
+- **Leaderboard:**
+- **Point of Contact:**
+
+### Dataset Summary
+
+OPUS DOGC is a collection of documents from the Official Journal of the Government of Catalonia, in Catalan and Spanish languages, provided by Antoni Oliver Gonzalez from the Universitat Oberta de Catalunya.
+
+### Supported Tasks and Leaderboards
+
+[More Information Needed]
+
+### Languages
+
+Dataset is multilingual with parallel text in:
+- Catalan
+- Spanish
+
+## Dataset Structure
+
+### Data Instances
+
+[More Information Needed]
+
+### Data Fields
+
+A data instance contains the following fields:
+- `ca`: the Catalan text
+- `es`: the aligned Spanish text
+
+### Data Splits
+
+[More Information Needed]
+
+## Dataset Creation
+
+### Curation Rationale
+
+[More Information Needed]
+
+### Source Data
+
+#### Initial Data Collection and Normalization
+
+[More Information Needed]
+
+#### Who are the source language producers?
+
+[More Information Needed]
+
+### Annotations
+
+#### Annotation process
+
+[More Information Needed]
+
+#### Who are the annotators?
+
+[More Information Needed]
+
+### Personal and Sensitive Information
+
+[More Information Needed]
+
+## Considerations for Using the Data
+
+### Social Impact of Dataset
+
+[More Information Needed]
+
+### Discussion of Biases
+
+[More Information Needed]
+
+### Other Known Limitations
+
+[More Information Needed]
+
+## Additional Information
+
+### Dataset Curators
+
+[More Information Needed]
+
+### Licensing Information
+
+Dataset is in the Public Domain under [CC0 1.0](https://creativecommons.org/publicdomain/zero/1.0/).
+
+### Citation Information
+
+```
+@inproceedings{tiedemann-2012-parallel,
+    title = "Parallel Data, Tools and Interfaces in {OPUS}",
+    author = {Tiedemann, J{\"o}rg},
+    booktitle = "Proceedings of the Eighth International Conference on Language Resources and Evaluation ({LREC}'12)",
+    month = may,
+    year = "2012",
+    address = "Istanbul, Turkey",
+    publisher = "European Language Resources Association (ELRA)",
+    url = "http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf",
+    pages = "2214--2218",
+    abstract = "This paper presents the current status of OPUS, a growing language resource of parallel corpora and related tools. The focus in OPUS is to provide freely available data sets in various formats together with basic annotation to be useful for applications in computational linguistics, translation studies and cross-linguistic corpus studies. In this paper, we report about new data sets and their features, additional annotation tools and models provided from the website and essential interfaces and on-line services included in the project.",
+}
+```
diff --git a/datasets/opus_dogc/dataset_infos.json b/datasets/opus_dogc/dataset_infos.json
@@ -0,0 +1 @@
+{"tmx": {"description": "This is a collection of documents from the Official Journal of the Government of Catalonia, in Catalan and Spanish languages, provided by Antoni Oliver Gonzalez from the Universitat Oberta de Catalunya.\n", "citation": "@inproceedings{tiedemann-2012-parallel,\n    title = \"Parallel Data, Tools and Interfaces in {OPUS}\",\n    author = {Tiedemann, J{\"o}rg},\n    booktitle = \"Proceedings of the Eighth International Conference on Language Resources and Evaluation ({LREC}'12)\",\n    month = may,\n    year = \"2012\",\n    address = \"Istanbul, Turkey\",\n    publisher = \"European Language Resources Association (ELRA)\",\n    url = \"http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf\",\n    pages = \"2214--2218\",\n    abstract = \"This paper presents the current status of OPUS, a growing language resource of parallel corpora and related tools. The focus in OPUS is to provide freely available data sets in various formats together with basic annotation to be useful for applications in computational linguistics, translation studies and cross-linguistic corpus studies. In this paper, we report about new data sets and their features, additional annotation tools and models provided from the website and essential interfaces and on-line services included in the project.\",\n}\n", "homepage": "http://opus.nlpl.eu/DOGC.php", "license": "", "features": {"translation": {"languages": ["ca", "es"], "id": null, "_type": "Translation"}}, "post_processed": null, "supervised_keys": {"input": "ca", "output": "es"}, "builder_name": "opus_dogc", "config_name": "tmx", "version": "0.0.0", "splits": {"train": {"name": "train", "num_bytes": 1258924464, "num_examples": 4763575, "dataset_name": "opus_dogc"}}, "download_checksums": {"http://opus.nlpl.eu/download.php?f=DOGC/v2/tmx/ca-es.tmx.gz": {"num_bytes": 331724078, "checksum": "838e3972516b3dda6d20ebe403e180b9fb8b7e36574a0f9dceaabf18ed3d322e"}}, "download_size": 331724078, "post_processing_size": null, "dataset_size": 1258924464, "size_in_bytes": 1590648542}}
diff --git a/datasets/opus_dogc/dummy/tmx/0.0.0/dummy_data.zip b/datasets/opus_dogc/dummy/tmx/0.0.0/dummy_data.zip
diff --git a/datasets/opus_dogc/opus_dogc.py b/datasets/opus_dogc/opus_dogc.py
@@ -0,0 +1,101 @@
+# coding=utf-8
+# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""OPUS DOGC dataset."""
+
+import xml.etree.ElementTree as ET
+
+import datasets
+
+
+_DESCRIPTION = """\
+This is a collection of documents from the Official Journal of the Government of Catalonia, in Catalan and Spanish \
+languages, provided by Antoni Oliver Gonzalez from the Universitat Oberta de Catalunya.
+"""
+
+_CITATION = """\
+@inproceedings{tiedemann-2012-parallel,
+    title = "Parallel Data, Tools and Interfaces in {OPUS}",
+    author = {Tiedemann, J{\"o}rg},
+    booktitle = "Proceedings of the Eighth International Conference on Language Resources and Evaluation ({LREC}'12)",
+    month = may,
+    year = "2012",
+    address = "Istanbul, Turkey",
+    publisher = "European Language Resources Association (ELRA)",
+    url = "http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf",
+    pages = "2214--2218",
+    abstract = "This paper presents the current status of OPUS, a growing language resource of parallel corpora and related tools. The focus in OPUS is to provide freely available data sets in various formats together with basic annotation to be useful for applications in computational linguistics, translation studies and cross-linguistic corpus studies. In this paper, we report about new data sets and their features, additional annotation tools and models provided from the website and essential interfaces and on-line services included in the project.",
+}
+"""
+
+_URL = "http://opus.nlpl.eu/DOGC.php"
+_FILE_FORMATS = ["tmx"]
+_URLS = {"tmx": "http://opus.nlpl.eu/download.php?f=DOGC/v2/tmx/ca-es.tmx.gz"}
+
+
+class OpusDogcConfig(datasets.BuilderConfig):
+    """ BuilderConfig for OpusDogcConfig."""
+
+    def __init__(self, file_format=None, **kwargs):
+        """
+
+        Args:
+            file_format: language of the subdataset.
+            **kwargs: keyword arguments forwarded to super.
+        """
+        super().__init__(
+            name=file_format, description=f"OPUS DOGC dataset from source file format {file_format}.", **kwargs
+        )
+        self.file_format = file_format
+
+
+class OpusDogc(datasets.GeneratorBasedBuilder):
+    """OPUS DOGC dataset."""
+
+    BUILDER_CONFIG_CLASS = OpusDogcConfig
+    BUILDER_CONFIGS = [OpusDogcConfig(file_format=file_format) for file_format in _FILE_FORMATS]
+
+    def _info(self):
+        return datasets.DatasetInfo(
+            description=_DESCRIPTION,
+            features=datasets.Features({"translation": datasets.features.Translation(languages=("ca", "es"))}),
+            supervised_keys=("ca", "es"),
+            homepage=_URL,
+            citation=_CITATION,
+        )
+
+    def _split_generators(self, dl_manager):
+        url_to_download = _URLS[self.config.file_format]
+        downloaded_file = dl_manager.download_and_extract(url_to_download)
+        return [
+            datasets.SplitGenerator(name=datasets.Split.TRAIN, gen_kwargs={"filepath": downloaded_file}),
+        ]
+
+    def _generate_examples(self, filepath):
+        xml_lang = "{http://www.w3.org/XML/1998/namespace}lang"
+        with open(filepath, encoding="utf-8") as f:
+            id_ = 0
+            for _, elem in ET.iterparse(f):
+                if elem.tag == "tuv":
+                    language = elem.attrib[xml_lang]
+                    sentence = elem.find("seg").text
+                    if language == "ca":
+                        ca_sentence = sentence
+                    elif language == "es":
+                        es_sentence = sentence
+                elif elem.tag == "tu":
+                    yield id_, {
+                        "translation": {"ca": ca_sentence, "es": es_sentence},
+                    }
+                    id_ += 1