Skip to content

Commit

Permalink
Add OPUS DOGC dataset (huggingface#987)
Browse files Browse the repository at this point in the history
* Add OPUS DOGC dataset

* Add dummy data

* Add dataset_infos.json

* Add tags to README

* Use Translation feature

* Update dataset_infos.json

* Update datasets/opus_dogc/README.md

* Fix README

* Update README.md

Co-authored-by: Yacine Jernite <yjernite@users.noreply.github.com>
  • Loading branch information
2 people authored and ggdupont committed Dec 4, 2020
1 parent 4ad157b commit 66d7a1e
Show file tree
Hide file tree
Showing 4 changed files with 257 additions and 0 deletions.
155 changes: 155 additions & 0 deletions datasets/opus_dogc/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
---
annotations_creators:
- no-annotation
language_creators:
- expert-generated
languages:
- ca
- es
licenses:
- cc0-1.0
multilinguality:
- translation
size_categories:
- n>1M
source_datasets:
- original
task_categories:
- conditional-text-generation
task_ids:
- machine-translation
---

# Dataset Card for OPUS DOGC

## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-instances)
- [Data Splits](#data-instances)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)

## Dataset Description

- **Homepage:** http://opus.nlpl.eu/DOGC.php
- **Repository:**
- **Paper:**
- **Leaderboard:**
- **Point of Contact:**

### Dataset Summary

OPUS DOGC is a collection of documents from the Official Journal of the Government of Catalonia, in Catalan and Spanish languages, provided by Antoni Oliver Gonzalez from the Universitat Oberta de Catalunya.

### Supported Tasks and Leaderboards

[More Information Needed]

### Languages

Dataset is multilingual with parallel text in:
- Catalan
- Spanish

## Dataset Structure

### Data Instances

[More Information Needed]

### Data Fields

A data instance contains the following fields:
- `ca`: the Catalan text
- `es`: the aligned Spanish text

### Data Splits

[More Information Needed]

## Dataset Creation

### Curation Rationale

[More Information Needed]

### Source Data

#### Initial Data Collection and Normalization

[More Information Needed]

#### Who are the source language producers?

[More Information Needed]

### Annotations

#### Annotation process

[More Information Needed]

#### Who are the annotators?

[More Information Needed]

### Personal and Sensitive Information

[More Information Needed]

## Considerations for Using the Data

### Social Impact of Dataset

[More Information Needed]

### Discussion of Biases

[More Information Needed]

### Other Known Limitations

[More Information Needed]

## Additional Information

### Dataset Curators

[More Information Needed]

### Licensing Information

Dataset is in the Public Domain under [CC0 1.0](https://creativecommons.org/publicdomain/zero/1.0/).

### Citation Information

```
@inproceedings{tiedemann-2012-parallel,
title = "Parallel Data, Tools and Interfaces in {OPUS}",
author = {Tiedemann, J{\"o}rg},
booktitle = "Proceedings of the Eighth International Conference on Language Resources and Evaluation ({LREC}'12)",
month = may,
year = "2012",
address = "Istanbul, Turkey",
publisher = "European Language Resources Association (ELRA)",
url = "http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf",
pages = "2214--2218",
abstract = "This paper presents the current status of OPUS, a growing language resource of parallel corpora and related tools. The focus in OPUS is to provide freely available data sets in various formats together with basic annotation to be useful for applications in computational linguistics, translation studies and cross-linguistic corpus studies. In this paper, we report about new data sets and their features, additional annotation tools and models provided from the website and essential interfaces and on-line services included in the project.",
}
```
1 change: 1 addition & 0 deletions datasets/opus_dogc/dataset_infos.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"tmx": {"description": "This is a collection of documents from the Official Journal of the Government of Catalonia, in Catalan and Spanish languages, provided by Antoni Oliver Gonzalez from the Universitat Oberta de Catalunya.\n", "citation": "@inproceedings{tiedemann-2012-parallel,\n title = \"Parallel Data, Tools and Interfaces in {OPUS}\",\n author = {Tiedemann, J{\"o}rg},\n booktitle = \"Proceedings of the Eighth International Conference on Language Resources and Evaluation ({LREC}'12)\",\n month = may,\n year = \"2012\",\n address = \"Istanbul, Turkey\",\n publisher = \"European Language Resources Association (ELRA)\",\n url = \"http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf\",\n pages = \"2214--2218\",\n abstract = \"This paper presents the current status of OPUS, a growing language resource of parallel corpora and related tools. The focus in OPUS is to provide freely available data sets in various formats together with basic annotation to be useful for applications in computational linguistics, translation studies and cross-linguistic corpus studies. In this paper, we report about new data sets and their features, additional annotation tools and models provided from the website and essential interfaces and on-line services included in the project.\",\n}\n", "homepage": "http://opus.nlpl.eu/DOGC.php", "license": "", "features": {"translation": {"languages": ["ca", "es"], "id": null, "_type": "Translation"}}, "post_processed": null, "supervised_keys": {"input": "ca", "output": "es"}, "builder_name": "opus_dogc", "config_name": "tmx", "version": "0.0.0", "splits": {"train": {"name": "train", "num_bytes": 1258924464, "num_examples": 4763575, "dataset_name": "opus_dogc"}}, "download_checksums": {"http://opus.nlpl.eu/download.php?f=DOGC/v2/tmx/ca-es.tmx.gz": {"num_bytes": 331724078, "checksum": "838e3972516b3dda6d20ebe403e180b9fb8b7e36574a0f9dceaabf18ed3d322e"}}, "download_size": 331724078, "post_processing_size": null, "dataset_size": 1258924464, "size_in_bytes": 1590648542}}
Binary file added datasets/opus_dogc/dummy/tmx/0.0.0/dummy_data.zip
Binary file not shown.
101 changes: 101 additions & 0 deletions datasets/opus_dogc/opus_dogc.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# coding=utf-8
# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""OPUS DOGC dataset."""

import xml.etree.ElementTree as ET

import datasets


_DESCRIPTION = """\
This is a collection of documents from the Official Journal of the Government of Catalonia, in Catalan and Spanish \
languages, provided by Antoni Oliver Gonzalez from the Universitat Oberta de Catalunya.
"""

_CITATION = """\
@inproceedings{tiedemann-2012-parallel,
title = "Parallel Data, Tools and Interfaces in {OPUS}",
author = {Tiedemann, J{\"o}rg},
booktitle = "Proceedings of the Eighth International Conference on Language Resources and Evaluation ({LREC}'12)",
month = may,
year = "2012",
address = "Istanbul, Turkey",
publisher = "European Language Resources Association (ELRA)",
url = "http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf",
pages = "2214--2218",
abstract = "This paper presents the current status of OPUS, a growing language resource of parallel corpora and related tools. The focus in OPUS is to provide freely available data sets in various formats together with basic annotation to be useful for applications in computational linguistics, translation studies and cross-linguistic corpus studies. In this paper, we report about new data sets and their features, additional annotation tools and models provided from the website and essential interfaces and on-line services included in the project.",
}
"""

_URL = "http://opus.nlpl.eu/DOGC.php"
_FILE_FORMATS = ["tmx"]
_URLS = {"tmx": "http://opus.nlpl.eu/download.php?f=DOGC/v2/tmx/ca-es.tmx.gz"}


class OpusDogcConfig(datasets.BuilderConfig):
""" BuilderConfig for OpusDogcConfig."""

def __init__(self, file_format=None, **kwargs):
"""
Args:
file_format: language of the subdataset.
**kwargs: keyword arguments forwarded to super.
"""
super().__init__(
name=file_format, description=f"OPUS DOGC dataset from source file format {file_format}.", **kwargs
)
self.file_format = file_format


class OpusDogc(datasets.GeneratorBasedBuilder):
"""OPUS DOGC dataset."""

BUILDER_CONFIG_CLASS = OpusDogcConfig
BUILDER_CONFIGS = [OpusDogcConfig(file_format=file_format) for file_format in _FILE_FORMATS]

def _info(self):
return datasets.DatasetInfo(
description=_DESCRIPTION,
features=datasets.Features({"translation": datasets.features.Translation(languages=("ca", "es"))}),
supervised_keys=("ca", "es"),
homepage=_URL,
citation=_CITATION,
)

def _split_generators(self, dl_manager):
url_to_download = _URLS[self.config.file_format]
downloaded_file = dl_manager.download_and_extract(url_to_download)
return [
datasets.SplitGenerator(name=datasets.Split.TRAIN, gen_kwargs={"filepath": downloaded_file}),
]

def _generate_examples(self, filepath):
xml_lang = "{http://www.w3.org/XML/1998/namespace}lang"
with open(filepath, encoding="utf-8") as f:
id_ = 0
for _, elem in ET.iterparse(f):
if elem.tag == "tuv":
language = elem.attrib[xml_lang]
sentence = elem.find("seg").text
if language == "ca":
ca_sentence = sentence
elif language == "es":
es_sentence = sentence
elif elem.tag == "tu":
yield id_, {
"translation": {"ca": ca_sentence, "es": es_sentence},
}
id_ += 1

0 comments on commit 66d7a1e

Please sign in to comment.