Hostname: page-component-586b7cd67f-r5fsc Total loading time: 0 Render date: 2024-11-30T23:23:55.511Z Has data issue: false hasContentIssue false

Creation of annotated country-level dialectal Arabic resources: An unsupervised approach

Published online by Cambridge University Press:  09 August 2021

Maha J. Althobaiti*
Affiliation:
Department of Computer Science, College of Computers and Information Technology, Taif University, Taif 21944, Saudi Arabia
*
Corresponding author. E-mail: [email protected]

Abstract

The wide usage of multiple spoken Arabic dialects on social networking sites stimulates increasing interest in Natural Language Processing (NLP) for dialectal Arabic (DA). Arabic dialects represent true linguistic diversity and differ from modern standard Arabic (MSA). In fact, the complexity and variety of these dialects make it insufficient to build one NLP system that is suitable for all of them. In comparison with MSA, the available datasets for various dialects are generally limited in terms of size, genre and scope. In this article, we present a novel approach that automatically develops an annotated country-level dialectal Arabic corpus and builds lists of words that encompass 15 Arabic dialects. The algorithm uses an iterative procedure consisting of two main components: automatic creation of lists for dialectal words and automatic creation of annotated Arabic dialect identification corpus. To our knowledge, our study is the first of its kind to examine and analyse the poor performance of the MSA part-of-speech tagger on dialectal Arabic contents and to exploit that in order to extract the dialectal words. The pointwise mutual information association measure and the geographical frequency of word occurrence online are used to classify dialectal words. The annotated dialectal Arabic corpus (Twt15DA), built using our algorithm, is collected from Twitter and consists of 311,785 tweets containing 3,858,459 words in total. We randomly selected a sample of 75 tweets per country, 1125 tweets in total, and conducted a manual dialect identification task by native speakers. The results show an average inter-annotator agreement score equal to 64%, which reflects satisfactory agreement considering the overlapping features of the 15 Arabic dialects.

Type
Article
Copyright
© The Author(s), 2021. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Abdul-Mageed, M., Alhuzali, H. and Elaraby, M. (2018). You tweet what you speak: a city-level dataset of Arabic dialects. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC’18), Miyazaki, Japan. European Language Resources Association, pp. 36533659.Google Scholar
Abdul-Mageed, M. and Diab, M. (2012). Toward building a large-scale Arabic sentiment lexicon. In Proceedings of the 6th International Global WordNet Conference, Matsue, Japan. Global WordNet Association, pp. 1822.Google Scholar
Abdul-Mageed, M. and Diab, M.T. (2014). SANA: a large scale multi-genre, multi-dialect lexicon for Arabic subjectivity and sentiment analysis. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland. European Language Resources Association, pp. 11621169.Google Scholar
Abdul-Mageed, M., Zhang, C., Bouamor, H. and Habash, N. (2020). NADI 2020: the first nuanced Arabic dialect identification shared task. In Proceedings of the Fifth Arabic Natural Language Processing Workshop, Barcelona, Spain (Online). Association for Computational Linguistics, pp. 97110.Google Scholar
Abidi, K., Menacer, M.-A. and Smaili, K. (2017). CALYOU: a comparable spoken Algerian corpus harvested from youtube. In Proceedings of the 18th Annual Conference of the International Communication Association (Interspeech), Stockholm, Sweden. International Speech Communication Association.CrossRefGoogle Scholar
Abouenour, L., Bouzoubaa, K. and Rosso, P. (2013). On the evaluation and improvement of Arabic WordNet coverage and usability. Language Resources and Evaluation 47(3), 891917.CrossRefGoogle Scholar
Abu Kwaik, K., Saad, M.K., Chatzikyriakidis, S. and Dobnik, S. (2018). Shami: a corpus of levantine Arabic dialects. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation, LREC’18, Miyazaki, Japan. European Language Resources Association, pp. 36453652.Google Scholar
Al-Sabbagh, R. and Girju, R. (2012). YADAC: Yet another Dialectal Arabic Corpus. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey. European Language Resources Association, pp. 28822889.Google Scholar
Al-Shargi, F., Kaplan, A., Eskander, R., Habash, N. and Rambow, O. (2016). Morphologically annotated corpora and morphological analyzers for Moroccan and Sanaani Yemeni Arabic. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Portorož, Slovenia. European Language Resources Association, pp. 13001306.Google Scholar
Al-Shargi, F. and Rambow, O. (2015). DIWAN: a dialectal word annotation tool for Arabic. In Proceedings of the Second Workshop on Arabic Natural Language Processing, Beijing, China. Association for Computational Linguistics, pp. 4958.CrossRefGoogle Scholar
Al-Twairesh, N., Al-Khalifa, H., Al-Salman, A. and Al-Ohali, Y. (2017). AraSenTi-Tweet: a corpus for Arabic sentiment analysis of Saudi tweets. Procedia Computer Science 117, 6372.CrossRefGoogle Scholar
Almeman, K. and Lee, M. (2013). Automatic building of Arabic multi dialect text corpora by bootstrapping dialect words. In Proceedings of the 1st International Conference on Communications, Signal Processing, and their Applications (ICCSPA 2013), Sharjah, United Arab Emirates. Institute of Electrical and Electronics Engineers (IEEE), pp. 16.CrossRefGoogle Scholar
Aloraini, A., Poesio, M. and Alhelbawy, A. (2020). The QMUL/HRBDT contribution to the nadi arabic dialect identification shared task. In Proceedings of the Fifth Arabic Natural Language Processing Workshop, Barcelona, Spain (Online). Association for Computational Linguistics, pp. 295301.Google Scholar
Alorini, D. and Rawat, D.B. (2019). Automatic spam detection on Gulf dialectical Arabic tweets. In International Conference on Computing, Networking and Communications (ICNC), Honolulu, HI, USA. IEEE, pp. 448452.CrossRefGoogle Scholar
Alsarsour, I., Mohamed, E., Suwaileh, R. and Elsayed, T. (2018). Dart: a large dataset of dialectal Arabic tweets. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC’18), Miyazaki, Japan. European Language Resources Association, pp. 36663670.Google Scholar
Alshutayri, A. and Atwell, E. (2017). Exploring Twitter as a source of an Arabic dialect corpus. International Journal of Computational Linguistics (IJCL) 8(2), 3744.Google Scholar
Althobaiti, M., Kruschwitz, U. and Poesio, M. (2014a). Aranlp: a java-based library for the processing of arabic text. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland. European Language Resources Association, pp. 41344138.Google Scholar
Althobaiti, M., Kruschwitz, U. and Poesio, M. (2014b). Automatic creation of Arabic named entity annotated corpus using wikipedia. In Proceedings of the Student Research Workshop at the 14th Conference of the European Chapter of the Association for Computational Linguistics, Gothenburg, Sweden. Association for Computational Linguistics, pp. 106115.CrossRefGoogle Scholar
Althobaiti, M.J. (2020a). Automatic Arabic dialect identification systems for written texts: a survey. International Journal of Computational Linguistics (IJCL) 11(3), 6189.Google Scholar
Althobaiti, M.J. (2020b). Automatic Arabic dialect identification systems for written texts: a survey. arXiv preprint arXiv:2009.12622.Google Scholar
Ameur, H., Jamoussi, S. and Hamadou, A.B. (2016). Exploiting emoticons to generate emotional dictionaries from facebook pages. In Intelligent Decision Technologies 2016. Cham: Springer, pp. 3949.CrossRefGoogle Scholar
Azoulay, A. (2017). World Arabic Language Day 2017. Available at https://en.unesco.org/world-arabic-language-day (accessed 10 September 2018).Google Scholar
Barkat, M., Ohala, J. and Pellegrino, F. (1999). Prosody as a distinctive feature for the discrimination of Arabic dialects. In Sixth European Conference on Speech Communication and Technology (EUROSPEECH’99), Budapest, Hungary. International Speech Communication Association (ISCA).CrossRefGoogle Scholar
Benajiba, Y., Diab, M. and Rosso, P. (2008). Arabic named entity recognition using optimized feature sets. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Waikiki, Honolulu, Hawaii. Association for Computational Linguistics, pp. 284–293.CrossRefGoogle Scholar
Black, W., Elkateb, S., Rodriguez, H., Alkhalifa, M., Vossen, P., Pease, A. and Fellbaum, C. (2006). Introducing the Arabic wordnet project. In Proceedings of the Third International WordNet Conference, Seogwipo, South Jeju Island, Korea. Global WordNet Association, pp. 295–300.Google Scholar
Bouamor, H., Habash, N. and Oflazer, K. (2014). A multidialectal parallel corpus of Arabic. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland. European Language Resources Association, pp. 1240–1245.Google Scholar
Bouamor, H., Habash, N., Salameh, M., Zaghouani, W., Rambow, O., Abdulrahim, D., Obeid, O., Khalifa, S., Eryani, F., Erdmann, A. and Oflazer, K. (2018). The MADAR Arabic dialect corpus and lexicon. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC’18), Miyazaki, Japan. European Language Resources Association, pp. 3387–3396.Google Scholar
Bouamor, H., Hassan, S. and Habash, N. (2019). The MADAR shared task on Arabic fine-grained dialect identification. In Proceedings of the Fourth Arabic Natural Language Processing Workshop, Florence, Italy. Association for Computational Linguistics, pp. 199–207.CrossRefGoogle Scholar
Bouchlaghem, R., Elkhlifi, A. and Faiz, R. (2014). Tunisian dialect Wordnet creation and enrichment using web resources and other Wordnets. In Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP), Doha, Qatar. Association for Computational Linguistics, pp. 104–113.CrossRefGoogle Scholar
Boujelbane, R., Khemekhem, M.E., BenAyed, S. and Belguith, L.H. (2013). Building bilingual lexicon to create Dialect Tunisian corpora and adapt language model. In Proceedings of the Second Workshop on Hybrid Approaches to Translation, Sofia, Bulgaria. Association for Computational Linguistics, pp. 88–93.Google Scholar
Cotterell, R. and Callison-Burch, C. (2014). A multi-dialect, multi-genre corpus of informal written Arabic. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland. European Language Resources Association, pp. 241–245.Google Scholar
Darwish, K. (2014). Arabizi detection and conversion to Arabic. In Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP), Doha, Qatar. Association for Computational Linguistics, pp. 217–224.CrossRefGoogle Scholar
Darwish, K., Mubarak, H., Abdelali, A., Eldesouki, M., Samih, Y., Alharbi, R., Attia, M., Magdy, W. and Kallmeyer, L. (2018). Multi-dialect Arabic POS tagging: a CRF approach. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association, pp. 93–98.Google Scholar
Diab, M. and Habash, N. (2007). Arabic dialect processing tutorial. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Tutorial Abstracts, Rochester, New York. Association for Computational Linguistics, pp. 56.CrossRefGoogle Scholar
Diab, M., Habash, N., Rambow, O., Altantawy, M. and Benajiba, Y. (2010). COLABA: Arabic dialect annotation and processing. In Proceedings of the LREC workshop on Semitic language processing, Malta. European Language Resources Association, pp. 66–74.Google Scholar
Diab, M.T., Al-Badrashiny, M., Aminian, M., Attia, M., Elfardy, H., Habash, N., Hawwari, A., Salloum, W., Dasigi, P. and Eskander, R. (2014). Tharwa: a large scale dialectal Arabic-Standard Arabic-English lexicon. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland. European Language Resources Association, pp. 3782–3789.Google Scholar
Duh, K. and Kirchhoff, K. (2005). POS tagging of dialectal Arabic: a minimally supervised approach. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, Ann Arbor, Michigan. Association for Computational Linguistics, pp. 55–62.CrossRefGoogle Scholar
Eberhard, D.M., Simons, G.F. and Fennig, C.D. (2019). Ethnologue: Languages of the World. Twenty-second edition. Dallas, Texas: SIL International. Available at http://www.ethnologue.com.Google Scholar
El-Beltagy, S.R. (2016). NileULex: a phrase and word level sentiment lexicon for Egyptian and modern standard Arabic. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Portoro, Slovenia. European Language Resources Association, pp. 2900–2905.Google Scholar
El-Beltagy, S.R. (2017). WeightedNileULex: A Scored Arabic Sentiment Lexicon for Improved Sentiment Analysis, vol. 4. World Scientific.Google Scholar
Eldesouki, M., Samih, Y., Abdelali, A., Attia, M., Mubarak, H., Darwish, K. and Laura, K. (2017). Arabic Multi-Dialect Segmentation: bi-LSTM-CRF vs. SVM. arXiv preprint arXiv:1708.05891.Google Scholar
Elfardy, H. and Diab, M. (2012). Token level identification of linguistic code switching. In Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012), Mumbai, India. The COLING 2012 Organizing Committee, pp. 287–296.Google Scholar
Elgabou, H.A. and Kazakov, D. (2017). Building dialectal Arabic corpora. In Proceedings of the First Workshop on Human-Informed Translation and Interpreting Technology (HiT-IT). Association for Computational Linguistics, pp. 52–57.CrossRefGoogle Scholar
Farid, D. and El-Tazi, N. (2020). Detection of cyberbullying in tweets in Egyptian dialects. International Journal of Computer Science and Information Security (IJCSIS) 18(7), 3441.Google Scholar
Ferguson, C.A. (1959). Diglossia. Word 15(2), 325340.CrossRefGoogle Scholar
Gadalla, H., Kilany, H., Arram, H., Yacoub, A., El-Habashi, A., Shalaby, A., Karins, K., Rowson, E., Mac-Intyre, R., Kingsbury, P., Graff, D. and McLemore, C. (1997). CALLHOME Egyptian Arabic Transcripts LDC97T19. (Web download). Philadelphia, Linguistic Data Consortium. Available at https://catalog.ldc.upenn.edu/LDC97T19.Google Scholar
Ghazali, S., Hamdi, R. and Barkat, M. (2002). Speech rhythm variation in Arabic dialects. In Speech Prosody 2002, International Conference, Aix-en-Provence, France. International Speech Communication Association (ISCA).Google Scholar
Habash, N., Diab, M.T. and Rambow, O. (2012a). Conventional orthography for aialectal Arabic. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey. European Language Resources Association, pp. 711–718.Google Scholar
Habash, N., Eryani, F., Khalifa, S., Rambow, O., Abdulrahim, D., Erdmann, A., Faraj, R., Zaghouani, W., Bouamor, H., Zalmout, N., Hassan, S., Al-Shargi, F., Alkhereyf, S., Abdulkareem, B., Eskander, R., Salameh, M. and Saddiki, H. (2018). Unified guidelines and resources for Arabic dialect orthography. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC’18), Miyazaki, Japan. European Language Resources Association, pp. 3628–3637.Google Scholar
Habash, N., Eskander, R. and Hawwari, A. (2012b). A morphological analyzer for Egyptian Arabic. In Proceedings of the Twelfth Meeting of the Special Interest Group on Computational Morphology and Phonology, Montréal, Canada. Association for Computational Linguistics, pp. 1–9.Google Scholar
Habash, N., Roth, R., Rambow, O., Eskander, R. and Tomeh, N. (2013). Morphological analysis and disambiguation for dialectal Arabic. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, Georgia. Association for Computational Linguistics, pp. 426432.Google Scholar
Habash, N., Soudi, A. and Buckwalter, T. (2007). On Arabic transliteration. In Soudi, A., van den Bosch, A. and Neumann, G. (eds), Arabic Computational Morphology: Knowledge-based and Empirical Methods. Netherlands: Springer, pp. 1522.CrossRefGoogle Scholar
Habash, N.Y. (2010). Introduction to Arabic natural language processing. Synthesis Lectures on Human Language Technologies 3(1), 1187.CrossRefGoogle Scholar
Hamdi, A., Shaban, K. and Zainal, A. (2018). Clasenti: a class-specific sentiment analysis framework. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 17(4), 128.CrossRefGoogle Scholar
Hammarström, H., Forkel, R. and Haspelmath, M. (2018). Glottolog 3.3 - Catalogue of Languages and Families. Jena, Germany: Max Planck Institute for the Science of Human History. Available at https://glottolog.org/.Google Scholar
Harrat, S., Abbas, M., Meftouh, K. and Smaili, K. (2013). Diacritics restoration for Arabic dialects. In Proceedings of 14th Annual Conference of the International Communication Association (INTERSPEECH 2013), Lyon, France. International Speech Communication Association (ISCA), pp. 125–132.CrossRefGoogle Scholar
Harrat, S., Meftouh, K., Abbas, M., Jamoussi, S., Saad, M. and Smaili, K. (2015). Cross-dialectal Arabic processing. In 16th International Conference on Intelligent Text Processing and Computational Linguistics, Cairo, Egypt. Springer, pp. 620632.CrossRefGoogle Scholar
Hetzron, R. (2005). The Semitic Languages. London and New York: Routledge, Taylor & Francis Group.Google Scholar
Holes, C. (2004). Modern Arabic: Structures, Functions, and Varieties. Washington, DC: Georgetown University Press.Google Scholar
Huang, F. (2015). Improved Arabic dialect classification with social media data. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal. Association for Computational Linguistics, pp. 2118–2126.CrossRefGoogle Scholar
Ibrahim, H.S., Abdou, S.M. and Gheith, M. (2015). MIKA: a tagged corpus for modern standard Arabic and colloquial sentiment analysis. In RETIS 2015: 2nd IEEE International Conference on Recent Trends in Information Systems, Jadavpur University, Kolkata. IEEE, pp. 353–358.CrossRefGoogle Scholar
Isola, P., Zoran, D., Krishnan, D. and Adelson, E.H. (2014). Crisp boundary detection using pointwise mutual information. In Proceedings of the 13th European Conference on Computer Vision (ECCV’14), Zurich, Switzerland. Springer, pp. 799–814.CrossRefGoogle Scholar
Jarrar, M., Habash, N., Alrimawi, F., Akra, D. and Zalmout, N. (2017). Curras: an annotated corpus for the Palestinian Arabic dialect. Language Resources and Evaluation 51(3), 745775.CrossRefGoogle Scholar
Jehl, L., Hieber, F. and Riezler, S. (2012). Twitter translation using translation-based cross-lingual retrieval. In Proceedings of the Seventh Workshop on Statistical Machine Translation, Montral, Canada. Association for Computational Linguistics, pp. 410–421.Google Scholar
Kanan, T., Aldaaja, A. and Hawashin, B. (2020). Cyber-bullying and cyber-harassment detection using supervised machine learning techniques in Arabic social media contents. Journal of Internet Technology 21(5), 1409–1421.Google Scholar
Kaye, A.S. and Rosenhouse, J. (2005). Arabic Dialects and Maltese. In Hetzron, R. (ed.), The Semitic Languages, Chapter 14. London and New York: Routledge, pp. 263–311.Google Scholar
Khalifa, S., Habash, N., Abdulrahim, D. and Hassan, S. (2016). A large scale corpus of Gulf Arabic. In Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016, Portoro, Slovenia. European Language Resources Association (ELRA), pp. 4282–4289.Google Scholar
Khalifa, S., Habash, N., Eryani, F., Obeid, O., Abdulrahim, D. and Al Kaabi, M. (2018). A morphologically annotated corpus of Emirati Arabic. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC’18), Miyazaki, Japan. European Language Resources Association, pp. 3839–3846.Google Scholar
Landis, J.R. and Koch, G.G. (1977). The measurement of observer agreement for categorical data. Biometrics 33(1), 159174.CrossRefGoogle ScholarPubMed
Maamouri, M., Bies, A., Buckwalter, T., Diab, M.T., Habash, N., Rambow, O. and Tabessi, D. (2006). Developing and using a pilot dialectal Arabic treebank. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06), Genoa, Italy. European Language Resources Association, pp. 443–448.Google Scholar
Maamouri, M., Bies, A., Buckwalter, T. and Mekki, W. (2004). The penn Arabic treebank: building a large-scale annotated Arabic corpus. In NEMLAR: International Conference on Arabic Language Resources and Tools, Cairo, Egypt. ELDA, pp. 466–467.Google Scholar
Maamouri, M., Bies, A., Kulick, S., Ciul, M., Habash, N. and Eskander, R. (2014). Developing an Egyptian Arabic treebank: impact of dialectal morphology on annotation and tool development. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland. European Language Resources Association, pp. 2348–2354.Google Scholar
Maamouri, M., Bies, A., Kulick, S., Krouna, S., Tabassi, D. and Ciul, M. (2018). BOLT Egyptian Arabic Treebank - Discussion Forum LDC2018T23. Web Download. Philadelphia: Linguistic Data Consortium, University of Pennsylvania.Google Scholar
Maamouri, M., Bies, A., Kulick, S., Tabessi, D. and Krouna, S. (2012). Egyptian Arabic Treebank - Discussion Forum, Parts 1-8 V2.0. LDC Catalog Numbers: LDC2012E93, LDC2012E98, LDC2012E89, LDC2012E99, LDC2012E107, LDC2012E125, LDC2013E12, LDC2013E21. Web Download. Philadelphia: Linguistic Data Consortium, University of Pennsylvania.Google Scholar
Malmasi, S., Zampieri, M., Ljubešić, N., Nakov, P., Ali, A. and Tiedemann, J. (2016). Discriminating between similar languages and Arabic dialect identification: a report on the third DSL shared task. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial6), Osaka, Japan. The COLING 2016 Organizing Committee, pp. 1–14.Google Scholar
Manning, C.D. and Schütze, H. (1999). Foundations of Statistical Natural Language Processing. Cambridge, MA, London, UK: MIT Press.Google Scholar
McCreadie, R., Soboroff, I., Lin, J., Macdonald, C., Ounis, I. and McCullough, D. (2012). On building a reusable Twitter corpus. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, Portland, Oregon, USA. Association for Computing Machinery, pp. 11131114.CrossRefGoogle Scholar
McNeil, K. and Faiza, M. (2011). Tunisian Arabic corpus: creating a written corpus of an unwritten language. In the 1st Workshop on Arabic Corpus Linguistics (WACL), Lancaster, UK. Lancaster University.Google Scholar
Medhaffar, S., Bougares, F., Estève, Y. and Hadrich-Belguith, L. (2017). Sentiment analysis of Tunisian dialects: linguistic ressources and experiments. In Proceedings of the third Arabic natural language processing workshop, Valencia, Spain. Association for Computational Linguistics, pp. 55–61.CrossRefGoogle Scholar
Meftouh, K., Harrat, S., Jamoussi, S., Abbas, M. and Smaili, K. (2015). Machine translation experiments on PADIC: a parallel Arabic dialect corpus. In Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation, Shanghai, China. Association for Computational Linguistics, pp. 26–34.Google Scholar
Mubarak, H. (2018). Dial2msa: a tweets corpus for converting dialectal arabic to modern standard arabic. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), OSACT2018 Workshop, Miyazaki, Japan. European Language Resources Association, pp. 49–53.Google Scholar
Mubarak, H. and Darwish, K. (2014). Using Twitter to collect a multi-dialectal corpus of Arabic. In Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP), Doha, Qatar. Association for Computational Linguistics, pp. 1–7.CrossRefGoogle Scholar
Pasha, A., Al-Badrashiny, M., Diab, M. T., El Kholy, A., Eskander, R., Habash, N., Pooleery, M., Rambow, O. and Roth, R. (2014). Madamira: a fast, comprehensive tool for morphological analysis and disambiguation of Arabic. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland. European Language Resources Association, pp. 1094–1101.Google Scholar
Roesslein, J. (2020). tweepy Documentation. Available at https://docs.tweepy.org.Google Scholar
Rosso, P., Rangel Pardo, F.M., Ghanem, B. and Charfi, A. (2018). ARAP: Arabic author profiling project for cyber-security. 61, 135–138. Sociedad Espanola para el Procesamiento del Lenguaje Natural.Google Scholar
Saadane, H. and Habash, N. (2015). A conventional orthography for Algerian Arabic. In Proceedings of the Second Workshop on Arabic Natural Language Processing, Beijing, China. Association for Computational Linguistics, pp. 69–79.CrossRefGoogle Scholar
Sadat, F., Kazemi, F. and Farzindar, A. (2014). Automatic identification of Arabic dialects in social media. In Proceedings of the First International Workshop on Social Media Retrieval and Analysis, Queensland, Australia. Association for Computing Machinery, pp. 35–40.CrossRefGoogle Scholar
Salama, A., Bouamor, H., Mohit, B. and Oflazer, K. (2014). YouDACC: the youtube dialectal Arabic comment corpus. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pp. 12461251.Google Scholar
Salameh, M., Bouamor, H. and Habash, N. (2018). Fine-grained Arabic dialect identification. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA. Association for Computational Linguistics, pp. 13321344.Google Scholar
Samih, Y., Attia, M., Eldesouki, M., Abdelali, A., Mubarak, H., Kallmeyer, L. and Darwish, K. (2017). A neural architecture for dialectal Arabic segmentation. In Proceedings of the Third Arabic Natural Language Processing Workshop, Valencia, Spain. Association for Computational Linguistics, pp. 4654.CrossRefGoogle Scholar
Shao, G. (2009). Understanding the appeal of user-generated media: a uses and gratification perspective. Internet Research 19(1), 725.CrossRefGoogle Scholar
Shoufan, A. and Alameri, S. (2015). Natural language processing for dialectical Arabic: a survey. In Proceedings of the Second Workshop on Arabic Natural Language Processing, Beijing, China. Association for Computational Linguistics, pp. 36–48.CrossRefGoogle Scholar
Siddiqui, M.A., Dahab, M.Y. and Batarfi, O.A. (2015). Building a sentiment analysis corpus with multifaceted hierarchical annotation. International Journal of Computational Linguistics (IJCL) 6(2), 1125.Google Scholar
Su, Q., Xiang, K., Wang, H., Sun, B. and Yu, S. (2006). Using pointwise mutual information to identify implicit features in customer reviews. In International Conference on Computer Processing of Oriental Languages (ICCPOL), Singapore, Singapore. Springer, pp. 22–30.CrossRefGoogle Scholar
Takezawa, T., Kikui, G., Mizushima, M. and Sumita, E. (2007). Multilingual spoken language corpus development for communication research. International Journal of Computational Linguistics & Chinese Language Processing, Special Issue on Invited Papers from ISCSLP 2006 12, 303324.Google Scholar
Testen, D. (2018). Semitic Languages. Encyclopedia Britannica. Available at http://www.britannica.com/EBchecked/topic/534171/Semitic-languages (accessed 10 September 2018).Google Scholar
Toutanova, K., Klein, D., Manning, C.D. and Singer, Y. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1. Association for computational Linguistics, pp. 173–180.CrossRefGoogle Scholar
Toutanova, K. and Manning, C.D. (2000). Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In Proceedings of the 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora: Held in Conjunction with the 38th Annual Meeting of the Association for Computational Linguistics, Hong Kong. Association for Computational Linguistics, pp. 63–70.CrossRefGoogle Scholar
Turki, H., Adel, E., Daouda, T. and Regragui, N. (2016). A conventional orthography for Maghrebi Arabic. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’16), Portorož, Slovenia. European Language Resources Association.Google Scholar
Van de Cruys, T. (2011). Two multivariate generalizations of pointwise mutual information. In Proceedings of the Workshop on Distributional Semantics and Compositionality, Portland, Oregon, USA. Association for Computational Linguistics, pp. 16–20.Google Scholar
Versteegh, K. (2014). Arabic Language. Edinburgh: Edinburgh University Press.CrossRefGoogle Scholar
Watson, J.C. (2002). The Phonology and Morphology of Arabic. Oxford: Oxford University Press.Google Scholar
Younes, J., Achour, H. and Souissi, E. (2015). Constructing linguistic resources for the Tunisian dialect using textual user-generated contents on the social web. In the 15th International Conference on Web Engineering (ICWE), Rotterdam, the Netherlands. Springer, pp. 3–14.CrossRefGoogle Scholar
Younes, J. and Souissi, E. (2014). A quantitative view of Tunisian dialect electronic writing. In 5th International Conference on Arabic Language Processing, Qujda, Morocco, pp. 63–72.Google Scholar
Zaghouani, W. and Charfi, A. (2018). Arap-tweet: a large multi-dialect twitter corpus for gender, age and language variety identification. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation, LREC’18, Miyazaki, Japan. European Language Resources Association, pp. 694–700.Google Scholar
Zaidan, O.F. and Callison-Burch, C. (2011). The Arabic online commentary dataset: an annotated dataset of informal Arabic with high dialectal content. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers, Portland, Oregon, USA. Association for Computational Linguistics, pp. 37–41.Google Scholar
Zaidan, O.F. and Callison-Burch, C. (2014). Arabic dialect identification. Computational Linguistics 40(1), 171202.CrossRefGoogle Scholar
Zbib, R., Malchiodi, E., Devlin, J., Stallard, D., Matsoukas, S., Schwartz, R., Makhoul, J., Zaidan, O.F. and Callison-Burch, C. (2012). Machine translation of Arabic dialects. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Montréal, Canada. Association for Computational Linguistics, pp. 4959.Google Scholar
Zribi, I., Ellouze, M., Belguith, L.H. and Blache, P. (2015). Spoken Tunisian Arabic corpus (STAC): transcription and annotation. Research in Computing Science, Special Issue: Advances in Computational Linguistics 90(1), 123135.Google Scholar
Zribi, I., Kammoun, I., Ellouze, M., Belguith, L. and Blache, P. (2016). Sentence boundary detection for transcribed Tunisian Arabic. In Proceedings of the 13th Conference on Natural Language Processing (KONVENS 2016), Bochum, Germany. Ruhr-University Bochum and German Society for Linguistics, Section Computational Linguistics (DGfS-CL), pp. 323331.Google Scholar