Nemlar logo


BLARK Specification for Arabic

Back to BLARK Home

The BLARK definition describes the type of resources that are needed, but it does not give an indication of the size or any other characteristic of each type of resource. We have examined the needs for Arabic and give our estimation below.
The figures are tentative, building on available experience, and may be changed if further work so suggests.

Written Resources

Monolingual lexicon:
For all components: 40,000 stems with POS, morphology
For sentence boundary detection: a list of conjunctions and other sentence starters/stoppers
For Named entity: proper names tagged. 50,000 human proper names needed
For semantic analysis: same 40,000 as for all components, but also with subcategorisation, lexical semantic information (concrete-abstract, animate, domain etc.). A wordnet would be good.

Multi-, bilingual lexicon:
Same size as monolingual lexicon, depending on application

Thesauri, ontologies, wordnets:
Thesauri: Subject tree with 200-300 nodes for each domain
Ontologies and wordnets should ideally be the same size as the lexicon
Terminological databases: Size will depend on the domain.

Unannotated corpora :
For term extraction: 100 mill words

Annotated corpora:
A minimum of 0.5 mill. may be used for a few applications
POS tagger, statistics based: 1-3 mill.
Sentence boundary: 0.5 – 1.5 mill.
Named entity, statistics based: 1.5 mill.
Term extraction: 100 mill
Co-reference resolution: 1 mill.
Word sense disambiguation: 2-3 mill.

Summing up, it seems that an annotated corpus of 2 mill. should meet most requirements.

Parallel multilingual corpora :
Alignment: 0.5 mill. tagged corpus

Multimodal corpora for hand OCR:
Grapheme recognition:
Specifications for this will follow in an updated version of the document.

Multimodal corpora for typed OCR:
Grapheme recognition
Specifications for this will follow in an updated version of the document.

Spoken Resources

Acoustic Data - The audio data required for:

About 50-100 speakers x 20mn, Transcribed fully vowelized + 10 speakers for testing; (It should be made available with a written corpus of a few mill words and a Phonetic lexicon (size of which depends on the Language Model), derived from a vowelized text (see written corpus below).

Telephony speech applications:
Requires about 500 speakers uttering around 50 different sentences and other items (SpeechDat family ( like (Orientel ( , UOB project), it should cover both Modern Colloquial Arabic, “middle Arabic” , MSA (Modern Standard Arabic), Fr/Eng, Conditions as for SpeechDat resources including a Phonetic lexicon in SAMPA (emphasise on digits, proper names, cities, companies, named entities, …).

Embedded speech recognition:
One may Use desktop data (dictation), but data similar to Speecon (see details for the acoustic conditions, set of 3-4 microphones, etc.) is preferable.

Transcription of broadcast News:
(BNSC: Broadcast News Speech Corpus). Transcribed Audio data. About 50 to 100 hours of well annotated speech (at the orthographic level), about 1000 hours of non transcribed data is useful. Should come with written corpus for Language Models (from newspapers + press-releases + transcriptions) of about 300 mill. of non annotated corpora (partly vowelized), it should come with a lexicon (like the previous ones), lexicon of Proper names with updating mechanisms from newspaper and media.

Transcription of conversational speech:
Data similar to CallHome / CallFriends from LDC (which covers mainly Egyptian Arabic) that may be extended with other varieties of Arabic (Maghrebian, Levantine, etc. ..)

Speaker recognition:
an audio corpus of about 500 speakers for training (labelling with speaker id but also orthographic transcriptions) uttering about 3mn of speech peer speaker, it requires also about 100 speakers for testing (amount of speech 0.5mn , incl. impostors, ….)

Dialect / language identification:
Data similar to LDC/NIST CALLFRIEND or extracted from Broadcast news speech transcripts; we may add a set of varieties of Arabic to extend the Egyptian variety at LDC.

Speech Synthesis Corpus:
(for Text to Speech, TTS) requires a male and female professional speakers; 15 hours (optimal, but realistically 5 hours may be OK) ; generated using a read phonetically balanced text (in some applications one may need 10 speakers x 100 sentences)

Formant Synthesis/Parametric Corpus:
Same database as for Speech Synthesis above with hand labelled ‘formant’ (min. half an hour).

Multimodal corpora for Lips analysis and generation:
Lips movement reading: the corpus could be similar to M2VTS with some 50 faces (see details
We anticipate that this would be a good candidate for the BLARK

Written corpus for speech technologies:

Un-annotated corpus:
About 300 mill. words, preferably from BNSC or press and media sources.

Annotated corpus:
This may be useful in order to derive phonetic lexicon and language models; may be same as for written technologies (min between 1 and 5 mill., other sizes for specific applications).

Vowelized corpus and Non-vowelized corpus:
This is important only if there is no way to obtain a vowelization tool and/or a phonetic lexicon.

Phonetic Lexion:
Phonetic lexicon (depends on the size of the language model and could be derived from a vowelized text; may be same size as for written technologies but fully vowelized.
A specific Phonetic lexicon emphasising on digits, proper names, cities, companies, named entities, …).
Lexicon of Proper names (including foreign names and entities) with updating mechanisms from newspaper and media, about 50K if used in conjunction with named entities.

MEDAR is supported by the European Commission's ICT programme and is running from
February 1st 2008 until July 31st 2010

European Flag