Nemlar logo





Contact:
nemlar@hum.ku.dk


Unannotated corpora

Back to BLARK for written language

Availability:
3 existent but only company-internal, 2 existent and freely usable for PreR&D, 1 existent and freely usable for both PreR&D and R&D.
Cost:
4 more than 10,000 €, 3 1,000 - 10,000 €, 2 100 - 1,000 €, 1 less than 100 € or free
Adaptability:
3 black box, 2 glass box (you can see but not change it) 1 freely manipulable

R means for research, C means for commercial use.
For availability = 3 (company internal) other features are irrelevant.

Name of Corpus Provider Size Other information Availability, cost, manip.
Al-Hayat Arabic Arabic Corpus ELRA 18,639,264 tokens The tokens cover 42,591 article within 7 domains Price: €480-1440
1,2,1 R 1,3,1 C
An-Nahar newspapers text corpus ELRA 24 million words The words are found in 45,000 articles; Arabic from Lebanon Price: €336-1008
1,2,1 R 1,3,1 C
Dinar-MBC Lyon2 10 million words Lit., essays, press 3
Fully diacritized/vowelized Text corpus RDI 3 million words Multi domain balanced coverage:, literature, business, science, sport, politics etc. 1,4,1
Arabic morphologically analyzed, PoS tagged andvowelized corpus RDI 750K words Multi domain balanced coverage:, literature, business, science, sport, politics etc. 1,4,1
Monolingual unannotated Sakhr 2 billion words Classified on a coarse grained subject tree 3
Fully diacritised monolingual Arabic corpus for Islamic domain Sakhr 80 million words   3
Le Monde Diplomatique ELRA 75,000 – 480,000 words   Price: €46-69 per year
AFP Corpus ELRA 450,000 documents   Price: To be announced
NEMLAR Written Corpus ELRA 500,000 words   Price: €150-2000
ArabiCorpus Brigham Young University 1 million words    
Arabic Wikipedia articles UPV (Y. Benajiba) 11,000 articles   Free
Arabic Gigaword LDC 400 million words   Price: $200-3000
General Scientific Arabic Corpus University of Manchester 1.6 million words    
Classical Arabic Corpus University of Manchester 5 million words    
Buckwalter Arabic corpus Tim Buckwalter 5 million words    
A corpus of Contemorary Arabic (CCA Corpus) University of Leeds (UK) 1 M words   Free to download
Arabic Newswire Corpus LDC 80 M words   $600 - 1200$
International corpus of Arabic (ICA) Bibliotheca Alexandrina, Egypt 100 M words   -
Khaleej-2004 corpus Mourad Abbas 3 M words, More than 5000 articles Articles taken from the online newspaper Akhbar Alkhaleej Free for research use
Watan-2004 corpus Mourad Abbas About 20.000 articles Articles taken from the online newspaper Omani 2004 Free for research use


MEDAR is supported by the European Commission's ICT programme and is running from
February 1st 2008 until July 31st 2010

European Flag