-
Notifications
You must be signed in to change notification settings - Fork 0
This repository is for the corpus data and results within the EAMT sponsorship 'Parallel Corpus to Analyse Differences between Human Translations', i-e- DiHUTra. WIthin this activity, a corpus of professional and student translations from English into Russian and Croatian is created.
katjakaterina/dihutra
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
DESCRIPTION The corpus Dihutra (Differences in Human Translations) has been collected within activities supported by EAMT (EAMT sponsorship 2021). It reprsents a corpus of human translations which contains both professional and students translations. The data consists of English sources - texts from news (from WMT2020 and 2019) and Amazon reviews - and their translations into Russian and Croatian, as well as of the subcorpus containing translations of Amazon review texts into Finnish. All target languages represent mid-resourced and less or mid-investigated ones. The same source texts were translated into the three target languages by two translator groups: students and professionals. We additionally include German professional translations of news, that were available in the WMT dataset. For the Amazon reviews subcorpus, each review in English was translated into the three target languages, Croatian, Russian and Finnish, by professionals and by students. For the news subcorpus, Russian translations were already available from the WMT shared task and Croatian translations were produced for the purpose of this work. Finnish professional translations were not provided for the news articles. In addition to translations, the information about age, gender, experience and the study program (for students) was collected. Translators were asked too keep the sentence alignment (not to merge or to split sentences so that each English sentence corresponds to one translated sentence, which is important for current MT systems) and not to use machine translation in the process of translation. No further restrictions were given to translators. The corpus is valuable for studying variation in translation as it allows a direct comparison between human translations of the same source texts. The corpus is also be a valuable resource for evaluating machine translation systems. We believe that this resource will facilitate understanding and improvement of the quality issues in both human and machine translation. The corpus is hosted by Fedora Commons Repository of the Saarland University (UdS) CLARIN-D centre. CITATION Persistent identifier http://hdl.handle.net/21.11119/0000-000A-1BA9-A Reference: Lapshinova-Koltunski, Ekaterina, Maja Popović and Maarit Koponen. 2022. DiHuTra: a Parallel Corpus to Analyse Differences between Human Translations. Submitted for LREC-2022. Licence The DiHuTra is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License: https://creativecommons.org/licenses/by-nc-sa/4.0/
About
This repository is for the corpus data and results within the EAMT sponsorship 'Parallel Corpus to Analyse Differences between Human Translations', i-e- DiHUTra. WIthin this activity, a corpus of professional and student translations from English into Russian and Croatian is created.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published