Nemlar logo


Parallel corpora

Back to BLARK for written language

3 existent but only company-internal, 2 existent and freely usable for PreR&D, 1 existent and freely usable for both PreR&D and R&D.
4 more than 10,000 €, 3 1,000 - 10,000 €, 2 100 - 1,000 €, 1 less than 100 € or free
3 black box, 2 glass box (you can see but not change it) 1 freely manipulable

R means for research, C means for commercial use.
For availability = 3 (company internal) other features are irrelevant.

> >
Name of Corpus Provider Size Language Other information Availability, price, manip.
EGYPT Giza Toolkit Quran Parallel Corpus CLSP/JHU   ar-en   Free
Sentence aligned bilingual Arabic English corpus Sakhr 1.35 million sentences ar-en, en-ar   3
CLARA (Corpus Linguae Arabicae) Charles University 37 million words ar-cz    
Bilingual aligned corpus ILC   ar-it    
Arabic English Parallel News Part 1(Umaah) LDC 2 million words ar, en Catalog no.: LDC2004T18 $1500-3000
Arabic News Translation Text Part 1 LDC 441,000 words ar, en Catalog no.: LDC2004T17 $1500-3000
Arabic Newswire English Translation Collection LDC 551,000 words ar-en Catalog no.: LDC2009T22 $1500
Multiple Translation Arabic –part 1 LDC 23,000 words ar,en Catalog no.: LDC2003T18 $500-1000
Multiple Translation Arabic – part 2 LDC 15,000 words ar,en Catalog no.: LDC2005T05 $500-1000
TDT4 Multilanguage Text and Annotation LDC   ar, en, ch Catalog no.: LDC2005T16 $200-2000
TDT5 Multilanguage Text LDC   ar, en, ch Catalog no.: LDC2006T18 $200-750
GALE Phase 1 Arabic Blog Parallel Text LDC unknown ar,en Catalog no.: LDC2008T02 1500$
GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1 LDC 90,000 words ar,en Catalog no.: LDC2007T24 1500$
GALE Phase 1 Arabic Broadcast News Parallel Text - Part 2 LDC 56,000 words ar,en Catalog no.: LDC2008T09 1500$
GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1 LDC unknown ar,en Catalog no.: LDC2009T03 1500$
ISI Arabic-English Automatically extracted parallel text LDC 1,1 million sentence pairs ar, en Catalog no.: LDC2007T08 $2000-4000
Holy Quran book islamware 78,500 words ar, en, fr, de, es   Free
E-A Parallel Corpus University of Kuwait 2 million words en-ar    
Multilingual Corpus University of Manchester 11.5 million words ar, en    
STRAND En-Ar Parallel web pages (tool and corpus) University of Maryland 2190 URL pairs en-ar   Free
Nijmegen Corpus Nijmegen University 2 million words ar-dutch   €130?
OPUS KDE Open source products' manuals OPUS e.g. EuroMatrix 300,000 tokens af, ar, az, be, bg, br, bs, ca, cs, cy, da, de, el, en, eo, es, et, eu, fi, fr, ga, gl, he, hr, hu, id, is, it, ja, ko, ku, lt, lv, mi, mk, mt, nb, nl, nn, oc, pl, pt, ro, ru, sk, sl, sr, sv, ta, th, tr, uk, ven, vi, wa, xh, zu   Free
United Nations General Assembly Resolutins Alexandre Rafalovitch, Robert Dale Ar: 2,721,463 words ar, en, fr, sp, ru, ch   Free for research purpose
Meedan Translation Memory Meedan 20 K sentence pairs ar, en  Open Database License (ODbL)
Microsoft Terminology Microsoft 12 K experssions ar, en,fr, gr, ch   
Arcade II - Evaluation Package ELRA 316,000 words ar,fr Le Monde Diplomatique aligned sentences €150-1000
CESTA Evaluation Package ELRA 60,000 words ar,fr The two corpora from Le Monde Diplomatique and from the UNICEF, WHO and FHI websites –translated from 1 to 4 times €150-1000

MEDAR is supported by the European Commission's ICT programme and is running from
February 1st 2008 until July 31st 2010

European Flag