ParIce

ParIce is an English-Icelandic parallel corpus for training MT systems. It consists of various subcorpora, some compiled from scratch and some collected from the internet and realigned and filtered as described in Barkarson and Steingrímsson, 2019. Compiling and Filtering ParIce: An English-Icelandic Parallel Corpus. In total it contains over 3.5 million parallel translation segments.

Subcorpus Segment Pairs
The Bible 32,964
Books 12,416
EEA documents 1,701,172
EMA 404,333
ESO 12,633
KDE4 49,912
OpenSubtitles 1,305,827
Sagas 17,597
Statistics Iceland 2,288
Tatoeba 8,263
Ubuntu 10,572

Further information available in the paper.

The corpus is published with a CC BY 4.0 license and is available for download here.

A version of the corpus, filtered as described in Jónsson et al., 2020 is available for download here. If you use that please consider citing the paper.

Dev/Test/Train

A dev/test/train split of the corpus was created and used in Jónsson et al., 2020. The dev and test sets, 12260 translation segments in total, were manually checked to some extent. The splits are available for download here.

KWIC

The corpus has been set up and can be searched in an online KWIC tool, available here.

People

The following people have worked on the corpus:

Starkaður Barkarson, compilation, alignment and filtering
Steinþór Steingrímsson, compilation, alignment and filtering
Þórður Arnar Árnason, evaluating dev/test sets
Þórdís Dröfn Andrésdóttir, evaluating dev/test sets

Downloads

The first version of the corpus: Download

The corpus filtered as described in Jónsson et al., 2020: Download

Dev/test/train sets

Other English-Icelandic parallel data

OPUS contains a plethora of parallel data, thereof 8.1 million sentences in English-Icelandic. It should be noted that the alignment and filtering for English-Icelandic is not always ideal on OPUS.

EN-IS Synthetic Parallel Corpus contains approx. 76 million back-translated sentences. 45 million translated from English to Icelandic and 31 million translated from Icelandic to English.

En-Is Semi-Synthetic Parallel Name Robustness Corpus contains approx. 38K sentences where person names have been identified in both source and target sentences in each pair, and replaces with other names of the same gender and having the same declension. This can help a network recognize more names than it otherwise would.

UD Icelandic PUD contains 1,000 sentences in Icelandic and English, translated from English, which can possible be used as test data.

Cite

If you use the ParIce data in your published research, please cite this paper:

    @inproceedings{barkarson-steingrimsson-2019-compiling,
    title = "Compiling and Filtering {P}ar{I}ce: An {E}nglish-{I}celandic Parallel Corpus",
    author = "Barkarson, Starka{\dh}ur and Steingr{\'\i}msson, Stein{\th}{\'o}r",
    booktitle = "Proceedings of the 22nd Nordic Conference on Computational Linguistics",
    year = "2019",
    address = "Turku, Finland",
    publisher = {Link{\"o}ping University Electronic Press},
    url = "https://www.aclweb.org/anthology/W19-6115",
    pages = "140--145",
    }

If you use the filtered dataset from Jónsson et al., 2020, please consider also citing this paper:

    @inproceedings{DBLP:conf/tsd/JonssonSSSL20,
    author = {Haukur P{\'{a}}ll J{\'{o}}nsson and Haukur Barri S{\'{\i}}monarson and V{\'{e}}steinn Sn{\ae}bjarnarson and Stein{\th}{\'{o}}r Steingr{\'{\i}}msson and Hrafn Loftsson},
    editor = {Petr Sojka and Ivan Kopecek and Karel Pala and Ales Hor{\'{a}}k},
    title = {Experimenting with Different Machine Translation Models in Medium-Resource Settings},
    booktitle = {Text, Speech, and Dialogue - 23rd International Conference, {TSD} 2020, Brno, Czech Republic, September 8-11, 2020, Proceedings},
    series = {Lecture Notes in Computer Science},
    volume = {12284},
    pages = {95--103},
    publisher = {Springer},
    year = {2020},
    doi = {10.1007/978-3-030-58323-1\_10},
    }