Welcome to the NEMLAR newsletter. We bring you the latest news on language resources and language technologies for Arabic in Europe and the Southern Mediterranean countries and keep you informed about upcoming events, and other useful information. A version of this newsletter is also available from: http://www.nemlar.org/Newsletter To subscribe, please send an email to: nemlar@cst.dk If you find this newsletter useful and informative, feel free to forward it to others. The newsletter will appear approximately four times per year. Please send any feedback you may have to: nemlar@cst.dk NEMLAR was a European Commission supported initiative from 2003 to 2005. NEMLAR was dedicated to surveying the state of the art of language resources for the Arabic language and the needs for such resources, to providing a BLARK specification for Arabic and to promote the development of Arabic language resources - in Europe and the Southern Mediterranean countries. Some of the work from the project will be continued: project web site, newsletter and updates of survey results and the BLARK. --------------------------------------------------------------------- Newsletter Content ------------------ 1. NEMLAR publications 2. The NEMLAR Arabic Language Resources 2. New language resources, books, papers, software and journals 3. Upcoming Events 4. Links 1. NEMLAR publications *********************** • Report on Basic Language Resource Kit (BLARK) for Arabic. For more information see http://www.nemlar.org/Publications/BLARK-final.pdf • Language Technology for Arabic. For more information see http://www.nemlar.org/Publications/Arabic_LT.pdf 2. The NEMLAR Arabic Language Resources *************************************** The NEMLAR Arabic LRs comprise a set of three resources (namely, the NEMLAR Arabic Written Corpus, NEMLAR Arabic Broadcast News Speech Corpus and NEMLAR Arabic Speech synthesis Corpus). These resources are owned and copyrighted by the NEMLAR Consortium and they will be available through ELRA (http://www.elra.info) in May 2006. The NEMLAR Arabic Written Corpus consists of 500K words of Standard Arabic text compiled from 13 different domains (political news, political debate, Islamic text, common-word phrases, text from Broadcast News, business, Arabic literature, general news, interviews, scientific press, sports press, dictionary-entry explanations and legal domain text), aiming to achieve a well-balanced corpus that offers a representation of the variety in syntactic, semantic and pragmatic features of modern Arabic language. The time span of the data included goes from late 1990’s to 2005. The corpus is provided in 4 different versions: a) raw text, b) fully vowelized text, c) text with Arabic lexical analysis, and d) Arabic POS-tagged text. The NEMLAR Arabic Broadcast News Speech Corpus consists of 40 hours of transcribed Standard Arabic data (from 209 male and 50 female speakers) recorded from four different radio stations. Each daily-broadcast recording contains between 25 and 30 minutes of news and interviews. Transcriptions follow Transcriber conventions with the additional patch for Arabic. Thus, transcriptions were done in Arabic characters and their transliterations were automatically generated. The character set used for the transliterations follows the ISO-8859 standard. The annotation levels included focused on orthographic transcription of speech, including named entities; speakers and speaker turns; segment markers; topic/story boundaries; background noises; change of background; music/noise, and word boundaries. The NEMLAR Arabic Speech synthesis Corpus Corpus has been produced so as to help build concatenative and parametric Arabic TTS systems. This corpus consists of 10 hours of annotated recorded speech from native Arabic speakers (5 hours of a male and 5 hours of a female speaker). All speech data was recorded at 96 kHz, 24 bits, 2 channels (one from a highly-sensitive large-membrane microphone, and the other for electroglottograph (EGG) signal). The prompt sheets created were the same for both male and female recordings. They contained 33,200 words that offered the following distribution: a) 6,600 were extracted from different domains of the NEMLAR Arabic Broadcast News Speech corpus; b) 16,500 were selected from different domains of the NEMLAR Arabic Written corpus; c) 3,500 represented frequent Arabic phrases, and d) the remaining 6,600 aimed to cover missing and rare diphones. The full corpus comprises the following components: orthographic transcription, prosodic transcription, phonetic transcription, phonetic segmentation and pitch marks. 3. New language resources, books, papers, software and journals *************************************************************** Books: • Current Issues in Linguistic Theory. Editor: Sami Boudelaa, University of Cambridge. For more information see http://www.benjamins.com/cgi-bin/t_bookview.cgi?bookid=CILT%20266 • The Arabic Linguistic Tradition by Georges Bohas, Jean-Patrick Guillaume, Djamel Kouloughli. For more information see http://press.georgetown.edu/detail.html?id=158901085x • The Arabic Language Today by A. F. L. Beeston. For more information see http://press.georgetown.edu/detail.html?id=1589010841 • Consonance in the Qur'an - A Conceptual, Intertextual and Linguistic Analysis by Hussein Abdul-Raof. For more information see https://ssl.kundenserver.de/www.s83009615.einsundeinsshop.de/sess/utn;jsessionid=1543e74bd719806/shopdata/index.shopscript • A Student Grammar of Modern Standard Arabic from Cambridge University Press. For more information see http://linguistlist.org/issues/16/16-407.html Language resources: • The GlobalPhone Database for Arabic,catalogue number S0192. For more information see http://catalog.elda.org/index.php?language=en • A Short Reference Grammar of Moroccan Arabic. For more information see http://linguistlist.org/issues/15/15-1365.html Weblog: • Prague Arabic Dependency Treebank 1.0. For more information see http://ufal.mff.cuni.cz/padt/online Journal: • Languages and Linguistics, new issue. For more information see http://www.lang-ling.tk • Journal of Arabic Linguistics Tradition (JALT), for more information see http://www.jalt.net Visit the NEMLAR web site for more information: http://www.nemlar.org 4. Upcoming Events ******************** • LREC 2006, Magazzini del Cotone Conference Center, May 22-28 2006, GENOA, ITALY.For more information see http://www.lrec-conf.org • JETALA, 5-6 June 2006, IERA, Rabat, Morocco, For more information see http://www.nemlar.org/Events/JETALA.txt • Conference on Communication and Information Structure in Spoken Arabic, June 8-10, 2006, University of Maryland, Stamp Student Union, Benjamin Banneker Room. For more information see http://www.nemlar.org/Events/Conf_Communication.txt • 3rd European Semantic Web Conference (ESWC 2006), 11 - 14 June 2006, Budva, Montenegro. For more information see href="http://www.eswc2006.org/ • ISDD 06 - International Symposium: DISCOURSE and DOCUMENT, 15-17 June 2006, Caen, France.For more information seehttp://discours2006.info.unicaen.fr/ • EAMT 11th Annual Conference, 19-20 June 2006, Oslo, Norway For more information see http://eamt.emmtee.net/ • TC-STAR Workshop on Speech-to-Speech Translation, June 19-21 2006 Barcelona, Spain.For more information see http://www.elda.org/tcstar-workshop/ • Coling/ACL2006, 17th - 21st July 2006, Sydney, Australia http://www.acl2006.mq.edu.au • ESSLLI 2006 - 18th European Summer School in Logic, Language and Information, 31 July - 11 August, 2006, Malaga, Spain. For more information see http://esslli2006.lcc.uma.es/give-page.php • AMTA 2006, August 8-12, 2006, Boston Marriott, Cambridge, Massachusetts, USA For more information see http://amta2006.amtaweb.org • FinTAL, 5th International Conference on Natural Language Processing 23-25 August 2006, Turku, Finland. For more information see http://www.it.utu.fi/fintal • Translating and the Computer 28: Conference, 16-17 November 2006, London. For more information see http://www.aslib.com/conferences • APA 2006: Treebanking and Advanced Processing of Arabic, November 30, 2006 Charles University in Prague, Czech Republic For more information see http://ufal.mff.cuni.cz/padt/TAPA2006 • Sixth Conference on Language Engineering, 6 - 7 December 2006 For more information see http://www.asueng.eun.eg/else • IEEE First International conference on digital information management, December 6 - 8 2006, Christ College, Bangalore, India. For more information see http://www.icdim.org • Beyond the Orient: The Research Challenges Ahead”21st International Conference on Computer Processing of Oriental Languages (ICCPOL2006),December 17-19, 2006, Singapore. For more information see http://www.iccpol-06.org Visit the NEMLAR web site for more information http://www.nemlar.org/Events 5. Links ******** • ELRA distributes Arabic language resources: http://www.elra.info • Linguistic Data Consortium distributes Arabic language resources - LDC: http://www.ldc.upenn.edu • Link to Arabic NLP technologies at RDI (to be found under the submenu item 'Arabic NLP' under the main menu item 'Technologies'): http://www.RDI-eg.com • The Faharis Site, list of Arabic web resources: http://www.faharis.net • Latifa Al-Sulaiti's homepage with collections of tools and resources for Arabic: http://www.comp.leeds.ac.uk/latifa/survey.htm • Link to free morphological analyzers for the Arabic language: http://www.glue.umd.edu/~kareem/research/ • Visit the Linguist List related to Arabic language: http://listserv.linguistlist.org/archives/arabic-l.html • List of pointers to Arabic and other Semitic NLP and Speech sites: http://www.elsnet.org/arabiclist.html • Lists of websites with theses dealing with Arabic human language technologies: http://www.biomath.jussieu.fr/ATALA/these/#Idx3 http://www.technolangue.net/rubrique.php3?id_rubrique=11 • Links to Arabic processing tools at the University of Leeds: http://www.comp.leeds.ac.uk/latifa/survey.htm • Links to Arabic tools, resources, conferences, etc: http://www.mghamdi.com/links.htm • Link to Free Lane's Arabic-English Lexicon: http://www.qaiu.org/archives/quran/arabic_lexicon_projects/index.html • Link to Arabic Language Home http://www.arabic-language.org ---------------------------------------------------------------------- This newsletter is published by the NEMLAR project (http://www.nemlar.org) and produced by Center for Sprogteknologi. To contact the project co-ordinator: Center for Sprogteknologi (CST), University of Copenhagen Project Co-ordinator: Bente Maegaard Tel: +45 35 32 90 74, Fax: +45 35 32 90 89 email: nemlar@cst.dk