ParIce is an English-Icelandic parallel corpus for training MT systems. It consists of various subcorpora, some compiled from scratch and some collected from the internet and realigned and filtered as described in Barkarson and Steingrímsson, 2019. Compiling and Filtering ParIce: An English-Icelandic Parallel Corpus. In total it contains over 3.5 million parallel translation segments.
Subcorpus | Segment Pairs |
---|---|
The Bible | 32,964 |
Books | 12,416 |
EEA documents | 1,701,172 |
EMA | 404,333 |
ESO | 12,633 |
KDE4 | 49,912 |
OpenSubtitles | 1,305,827 |
Sagas | 17,597 |
Statistics Iceland | 2,288 |
Tatoeba | 8,263 |
Ubuntu | 10,572 |
Further information available in the paper.
The corpus is published with a CC BY 4.0 license and is available for download here.
A version of the corpus, filtered as described in Jónsson et al., 2020 is available for download here. If you use that please consider citing the paper.
A dev/test/train split of the corpus was created and used in Jónsson et al., 2020. The dev and test sets, 12260 translation segments in total, were manually checked to some extent. The splits are available for download here.
The corpus has been set up and can be searched in an online KWIC tool, available here.
The following people have worked on the corpus:
Starkaður Barkarson, compilation, alignment and filtering
Steinþór Steingrímsson, compilation, alignment and filtering
Þórður Arnar Árnason, evaluating dev/test sets
Þórdís Dröfn Andrésdóttir, evaluating dev/test sets
The first version of the corpus: Download
The corpus filtered as described in Jónsson et al., 2020: Download
OPUS contains a plethora of parallel data, thereof 8.1 million sentences in English-Icelandic. It should be noted that the alignment and filtering for English-Icelandic is not always ideal on OPUS.
EN-IS Synthetic Parallel Corpus contains approx. 76 million back-translated sentences. 45 million translated from English to Icelandic and 31 million translated from Icelandic to English.
En-Is Semi-Synthetic Parallel Name Robustness Corpus contains approx. 38K sentences where person names have been identified in both source and target sentences in each pair, and replaces with other names of the same gender and having the same declension. This can help a network recognize more names than it otherwise would.
UD Icelandic PUD contains 1,000 sentences in Icelandic and English, translated from English, which can possible be used as test data.
If you use the ParIce data in your published research, please cite this paper:
@inproceedings{barkarson-steingrimsson-2019-compiling,
title = "Compiling and Filtering {P}ar{I}ce: An {E}nglish-{I}celandic Parallel Corpus",
author = "Barkarson, Starka{\dh}ur and Steingr{\'\i}msson, Stein{\th}{\'o}r",
booktitle = "Proceedings of the 22nd Nordic Conference on Computational Linguistics",
year = "2019",
address = "Turku, Finland",
publisher = {Link{\"o}ping University Electronic Press},
url = "https://www.aclweb.org/anthology/W19-6115",
pages = "140--145",
}
If you use the filtered dataset from Jónsson et al., 2020, please consider also citing this paper:
@inproceedings{DBLP:conf/tsd/JonssonSSSL20,
author = {Haukur P{\'{a}}ll J{\'{o}}nsson and Haukur Barri S{\'{\i}}monarson and V{\'{e}}steinn Sn{\ae}bjarnarson and Stein{\th}{\'{o}}r Steingr{\'{\i}}msson and Hrafn Loftsson},
editor = {Petr Sojka and Ivan Kopecek and Karel Pala and Ales Hor{\'{a}}k},
title = {Experimenting with Different Machine Translation Models in Medium-Resource Settings},
booktitle = {Text, Speech, and Dialogue - 23rd International Conference, {TSD} 2020, Brno, Czech Republic, September 8-11, 2020, Proceedings},
series = {Lecture Notes in Computer Science},
volume = {12284},
pages = {95--103},
publisher = {Springer},
year = {2020},
doi = {10.1007/978-3-030-58323-1\_10},
}