Skip to content

This repository is for the corpus data and results within the EAMT sponsorship 'Parallel Corpus to Analyse Differences between Human Translations', i-e- DiHUTra. WIthin this activity, a corpus of professional and student translations from English into Russian and Croatian is created.

Notifications You must be signed in to change notification settings

katjakaterina/dihutra

Repository files navigation

DESCRIPTION
The corpus Dihutra (Differences in Human Translations) has been collected within activities supported by EAMT (EAMT sponsorship 2021). It reprsents a corpus of human translations which contains both professional and students translations. The data consists of English sources - texts from news (from WMT2020 and 2019) and Amazon reviews - and their translations into Russian and Croatian, as well as of the subcorpus containing translations of Amazon review texts into Finnish. All target languages represent mid-resourced and less or mid-investigated ones. The same source texts were translated into the three target languages by two translator groups: students and professionals. We additionally include German professional translations of news, that were available in the WMT dataset.

For the Amazon reviews subcorpus, each review in English was translated into the three target languages, Croatian, Russian and Finnish, by professionals and by students.

For the news subcorpus, Russian translations were already available from the WMT shared task and Croatian translations were produced for the purpose of this work. Finnish professional translations were not provided for the news articles.

In addition to translations, the information about age, gender, experience and the study program (for students) was collected. Translators were asked too keep the sentence alignment (not to merge or to split sentences so that each English sentence corresponds to one translated sentence, which is important for current MT systems) and not to use machine translation in the process of translation. No further restrictions were given to translators.

The corpus is valuable for studying variation in translation as it allows a direct comparison between human translations of the same source texts. The corpus is also be a valuable resource for evaluating machine translation systems. We believe that this resource will facilitate understanding and improvement of the quality issues in both human and machine translation. 

The corpus is hosted by Fedora Commons Repository of the Saarland University (UdS) CLARIN-D centre.

CITATION

Persistent identifier http://hdl.handle.net/21.11119/0000-000A-1BA9-A

Reference:  Lapshinova-Koltunski, Ekaterina, Maja Popović and Maarit Koponen. 2022. DiHuTra: a Parallel Corpus to Analyse Differences between Human Translations. Submitted for LREC-2022. 

Licence

The DiHuTra is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License: https://creativecommons.org/licenses/by-nc-sa/4.0/

About

This repository is for the corpus data and results within the EAMT sponsorship 'Parallel Corpus to Analyse Differences between Human Translations', i-e- DiHUTra. WIthin this activity, a corpus of professional and student translations from English into Russian and Croatian is created.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published