GitHub - katjakaterina/dihutra: This repository is for the corpus data and results within the EAMT sponsorship 'Parallel Corpus to Analyse Differences between Human Translations', i-e- DiHUTra. WIthin this activity, a corpus of professional and student translations from English into Russian and Croatian is created.

katjakaterina / dihutra Public

Notifications You must be signed in to change notification settings
Fork 0
Star 2

This repository is for the corpus data and results within the EAMT sponsorship 'Parallel Corpus to Analyse Differences between Human Translations', i-e- DiHUTra. WIthin this activity, a corpus of professional and student translations from English into Russian and Croatian is created.

2 stars 0 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 144 Commits
analyses		analyses
fortranslators		fortranslators
metadata		metadata
news		news
pos3grams/reviews		pos3grams/reviews
processing		processing
reviews		reviews
README		README
en-fi.reviews.profs.id+en+fi.tsv		en-fi.reviews.profs.id+en+fi.tsv
en-fi.reviews.studs.id+en+fi.tsv		en-fi.reviews.studs.id+en+fi.tsv
en-hr.news+reviews.profs.id+en+hr.tsv		en-hr.news+reviews.profs.id+en+hr.tsv
en-hr.news+reviews.studs.id+en+hr.tsv		en-hr.news+reviews.studs.id+en+hr.tsv
en-ru.news+reviews.profs.id+en+ru.tsv		en-ru.news+reviews.profs.id+en+ru.tsv
en-ru.news+reviews.studs.id+en+ru.tsv		en-ru.news+reviews.studs.id+en+ru.tsv

Repository files navigation

DESCRIPTION
The corpus Dihutra (Differences in Human Translations) has been collected within activities supported by EAMT (EAMT sponsorship 2021). It reprsents a corpus of human translations which contains both professional and students translations. The data consists of English sources - texts from news (from WMT2020 and 2019) and Amazon reviews - and their translations into Russian and Croatian, as well as of the subcorpus containing translations of Amazon review texts into Finnish. All target languages represent mid-resourced and less or mid-investigated ones. The same source texts were translated into the three target languages by two translator groups: students and professionals. We additionally include German professional translations of news, that were available in the WMT dataset.

For the Amazon reviews subcorpus, each review in English was translated into the three target languages, Croatian, Russian and Finnish, by professionals and by students.

For the news subcorpus, Russian translations were already available from the WMT shared task and Croatian translations were produced for the purpose of this work. Finnish professional translations were not provided for the news articles.

In addition to translations, the information about age, gender, experience and the study program (for students) was collected. Translators were asked too keep the sentence alignment (not to merge or to split sentences so that each English sentence corresponds to one translated sentence, which is important for current MT systems) and not to use machine translation in the process of translation. No further restrictions were given to translators.

The corpus is valuable for studying variation in translation as it allows a direct comparison between human translations of the same source texts. The corpus is also be a valuable resource for evaluating machine translation systems. We believe that this resource will facilitate understanding and improvement of the quality issues in both human and machine translation.

The corpus is hosted by Fedora Commons Repository of the Saarland University (UdS) CLARIN-D centre.

CITATION

Persistent identifier http://hdl.handle.net/21.11119/0000-000A-1BA9-A

Reference: Lapshinova-Koltunski, Ekaterina, Maja Popović and Maarit Koponen. 2022. DiHuTra: a Parallel Corpus to Analyse Differences between Human Translations. Submitted for LREC-2022.

Licence

The DiHuTra is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License: https://creativecommons.org/licenses/by-nc-sa/4.0/