DarijaBanking: A new resource for overcoming language barriers in banking intent detection for Moroccan Arabic speakers

Abderrahman Skiredj; Ferdaous Azhari; Ismail Berrada; Saad Ezzini

doi:10.1017/nlp.2024.55

DarijaBanking: A new resource for overcoming language barriers in banking intent detection for Moroccan Arabic speakers

Part of: NLP Editorial Board access (current content+all back content)

Published online by Cambridge University Press: 05 December 2024

Ismail Berrada and

Abderrahman Skiredj*: Affiliation:
OCP Solutions, Casablanca, Morocco UM6P College of Computing, Benguerir, Morocco
Ferdaous Azhari: Affiliation:
National Institute of Posts and Telecoms, Rabat, Morocco
Ismail Berrada: Affiliation:
UM6P College of Computing, Benguerir, Morocco
Saad Ezzini: Affiliation:
School of Computing and Communications, Lancaster University, Lancaster, UK
*: Corresponding author: Abderrahman Skiredj; Email: [email protected]

Article contents

Abstract
Introduction
Related work
Background on Moroccan Darija
The DarijaBanking corpus
Intent detection approaches
Experiments and results
Discussion
Conclusion
Footnotes
References

Rights & Permissions

Abstract

Navigating the complexities of language diversity is a central challenge in developing robust natural language processing systems, especially in specialized domains like banking. The Moroccan Dialect of Arabic (Darija) serves as a common language that blends cultural complexities, historical impacts, and regional differences, which presents unique challenges for language models due to its divergence from Modern Standard Arabic and influence from French, Spanish, and Tamazight. To tackle these challenges, this paper introduces Darija Banking, a novel Darija dataset aimed at enhancing intent classification in the banking domain. DarijaBanking comprises over 1800 parallel high-quality queries in Darija, Modern Standard Arabic (MSA), English, and French, organized into 24 intent classes. We experimented various intent classification methods, including full fine-tuning of monolingual and multilingual models, zero-shot learning, retrieval-based approaches, and Large Language Model prompting. Furthermore, we propose BERTouch, a BERT-based language model fine-tuned on intent detection in Darija, which outperforms state-of-the-art models, including OpenAI’s GPT-4, achieving F1 scores of 0,98 and 0,96 on both Darija and MSA, respectively. The results provide insights into enhancing Moroccan Darija banking intent detection systems, highlighting the value of domain-specific data annotation and balancing precision and cost-effectiveness.

Keywords

financial NLP Moroccan Arabic (Darija)banking intent detection natural language understanding low-resource languages

Type: Article
Information: Natural Language Processing , First View , pp. 1 - 31

DOI: https://doi.org/10.1017/nlp.2024.55 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright: © The Author(s), 2024. Published by Cambridge University Press

1. Introduction

The field of Natural Language Processing (NLP) has gained significant traction, particularly with the emergence and advancement of Large Language Models (LLMs). This surge in interest and implementation is especially notable in industries where customer engagement and relationships are critical. In the retail banking domain, Financial NLP and LLMs are reshaping the dynamics of client interactions, fostering enhanced personalization, responsiveness, and ultimately, customer loyalty and satisfaction (Xu et al. Reference Xu, Shieh, van Esch and Ling2020).

In the ecosystem of generative AI-enhanced customer service, powerful LLMs such as GPT-4 (OpenAI 2024a) act as the brain of the system, orchestrating various components to deliver nuanced and contextually relevant interactions (Xi et al. Reference Xi, Chen, Guo, He, Ding, Hong, Zhang, Wang, Jin, Zhou, Zheng, Fan, Wang, Xiong, Zhou, Wang, Jiang, Zou, Liu, Yin, Dou, Weng, Cheng, Zhang, Qin, Zheng, Qiu, Huang and Gui2023). Among these components, retrieval-augmented generation (RAG) (Lewis et al. Reference Lewis, Perez, Piktus, Petroni, Karpukhin, Goyal, Küttler, Lewis, tau Yih, Rocktäschel, Riedel and Kiela2020) stands out by augmenting the agents’ responses with information retrieved from a dense knowledge base, thereby enriching the quality and relevance of the interactions. However, the linchpin in this setup is the Intent Classification Module, a critical aspect of Financial NLP’s Natural Language Understanding.

Intent classification, also known as Intent Detection, focuses on deciphering the semantic essence of user inputs to elicit the most appropriate response. It involves creating a linkage between a user’s request and the corresponding actions initiated by the chatbot, as outlined by Adamopoulou and Moussiades (Reference Adamopoulou and Moussiades2020). Typically framed as a classification challenge, this process entails associating each user utterance with one or, in certain cases, multiple intended actions. The task of intent classification presents notable difficulties. Conversations with chatbots often involve terse utterances that offer minimal contextual clues for accurate intent prediction. Furthermore, the extensive variety of potential intents necessitates a vast dataset for annotation, complicating the detection process due to the expansive label space that must be managed.

In this paper, we introduce DarijaBanking, a comprehensive Darija intent dataset, and embark on a systematic comparison of diverse intent classification methodologies, encompassing full fine-tuning of both monolingual and multilingual models, zero-shot learning, retrieval-based strategies, and Large Language Model (LLM) prompting. DarijaBanking is meticulously curated from three foundational English banking datasets, namely:

• Banking77 dataset (Casanueva et al. Reference Casanueva, Temcinas, Gerz, Handerson and Vulic2020a), which provides a corpus of 13083 banking-related queries each mapped to one of 77 distinct intents,
• Banking-Faq-Bot dataset (Patel 2017), which includes 1764 questions distributed across 7 intent categories, and
• Smart-Banking-Chatbot dataset (Lakkad Reference Lakkad2018), encompassing 30100 questions sorted into 41 intents, cumulatively aggregating to a rich repository of 42000 questions.

The initial phase involved a cleaning process aimed at intent unification across these diverse sources, addressing various challenges such as generic or colloquially dense sentences, redundancy, and contextual incompatibilities with the Moroccan banking sector. This refinement resulted in a distilled collection of 1800 English sentences, subsequently translated into French and Modern Standard Arabic (MSA) using the OPUS MT (Jörg and Santhosh Reference Jörg and Santhosh2020) and Turjuman (Nagoudi, Elmadany, and Abdul-Mageed Reference Nagoudi, Elmadany and Abdul-Mageed2022) models, respectively. Subsequently, a crucial translation step involved leveraging GPT-4 (OpenAI 2024 a) to translate these English sentences into Darija.

To ensure a reliable level of accuracy and contextual relevance, particularly for the Darija translations, we employed a team of five external human native speakers as annotators. The annotators performed detailed manual checks and corrections, focusing on the nuanced aspects of the Moroccan dialect that automated systems might overlook. Their important work was vital in improving the dataset’s usefulness for our intended applications by precisely reflecting the linguistic nuances of Darija. It is noteworthy that all utterances underwent manual review, with approximately 47% of them edited to further ensure their accuracy and idiomatic appropriateness.

The resultant dataset features a total of 7200 queries across English, French, MSA, and Darija and comprises 1800 queries for each language. The MSA and Darija subsets serve as the foundation for assessing various intent classification methods, including full fine-tuning of monolingual and multilingual models, zero-shot learning, retrieval-based approaches, and Large Language Models prompting.

Thus, the main contributions of our paper can be summarized as follows:

• The introduction of the DarijaBanking dataset serves as a novel resource for banking intent detection in Darija, featuring over 7200 multilingual queries across English, French, MSA, and Darija across 24 intent categories. DarijaBanking is made publicly available (Skiredj et al. Reference Skiredj, Azhari, Berrada and Ezzini2024b).
• A comparative analysis of intent classification methods, including monolingual and multilingual fine-tuning, zero-shot learning, retrieval strategies, and LLM prompting, is presented. Additionally, an in-depth error analysis examines intent classification difficulty and the influence of linguistic divergence between Darija and MSA.
• The introduction of BERTouch, our Darija-specific BERT-based language model, that is tailored for the intent classification task. BERTouch has demonstrated to be competitive with state-of-the-art solution including large language models such as GPT-4. BERTouch is made publicly available on the HuggingFace platform (Skiredj et al. Reference Skiredj, Azhari, Berrada and Ezzini2024a).
• Insightful discussion on enhancing Moroccan Darija banking intent detection systems. In the paper, we discuss the tradeoff between the use of generalist LLM and cross-lingual transfer learning, and we highlight the importance of domain-specific data annotation. In our analysis, we examine the tradeoff between the precision of specialized classifiers and the cost-effectiveness of retrieval-based methodologies.

The remainder of the paper is structured as follows. Section 2 explores the related work, establishing a contextual backdrop. Section 3 provides a brief overview of Moroccan Darija, highlighting its unique grammatical, syntactic, and lexical differences from MSA. Section 4 introduces the DarijaBanking corpus, detailing the data creation processes. Section 5 explores the array of intent classification approaches, including model architectures and training paradigms. Section 6 presents the empirical results derived from these methodologies along with an in-depth Error Analysis subsection. In Section 7, we analyze the experimental results and address the limitations encountered in our study, leading to the conclusion of the paper.

2. Related work

Intent detection is pivotal in developing conversational AI systems, particularly in specialized domains such as banking, where precision and efficiency are paramount. Traditional methods often rely on full fine-tuning of large pre-trained language models such as BERT. However, Casanueva et al. (Reference Casanueva, Temcinas, Gerz, Henderson and Vulic2020b) demonstrated that full fine-tuning may not be necessary. Instead, they proposed a more efficient feature-based approach using fixed universal sentence encoders such as USE (Cer et al. Reference Cer, Yang, Kong, Hua, Limtiaco, St John, Constant, Guajardo-Cespedes, Yuan, Tar, Sung, Strope and Kurzweil2018) or ConveRT (Henderson et al. Reference Henderson, Casanueva, Mrkšić, Su, Wen and Vulić2020). In this method, utterances are encoded using these pre-trained “off-the-shelf” models, and a lightweight classifier, typically a multi-layer perceptron, is trained on top of the fixed embeddings. This approach maintains performance comparable to that of fully fine-tuned models while significantly reducing computational resources and training time. The tradeoff involves sacrificing some adaptability to domain-specific complexities that fine-tuning might capture, but gains in efficiency making it practical for deployment in real-world applications.

Multilingual intent detection introduces additional complexities due to language diversity and resource limitations. Gerz et al. (Reference Gerz, Su, Kusztos, Mondal, Lis, Singhal, Mrksic, Wen and Vulic2021) addressed these challenges using multilingual sentence encoders such as the Multilingual Universal Sentence Encoder (mUSE) (Chidambaram et al. Reference Chidambaram, Yang, Cer, Yuan, Sung, Strope and Kurzweil2019) and Language-agnostic BERT Sentence Embedding (LaBSE) (Feng et al. Reference Feng, Yang, Cer, Arivazhagan and Wang2020). The authors evaluated these models in various training scenarios, including zero-shot and few-shot learning, on the MINDS-14 dataset (Gerz et al. Reference Gerz, Su, Kusztos, Mondal, Lis, Singhal, Mrksic, Wen and Vulic2021) covering multiple languages in the e-banking domain. Their findings indicate that even minimal domain-specific training data can significantly enhance performance over zero-shot models. However, the tradeoff lies in the dependence on the quality and availability of multilingual resources and the potential impact of errors from automatic speech recognition systems when dealing with spoken data.

In scenarios with limited data and semantically similar intents, few-shot learning techniques become essential. Zhang et al. (2021) proposed a method combining contrastive pre-training and fine-tuning (CPFT) to improve few-shot intent detection. Their approach involves a self-supervised contrastive pre-training phase that learns discriminative representations without labels, followed by a supervised fine-tuning phase that incorporates both intent classification and contrastive learning losses. This method effectively distinguishes between semantically similar intents, achieving state-of-the-art performance on challenging datasets under few-shot settings. The tradeoff with CPFT is the increased complexity and potential sensitivity to hyperparameters, which may require careful tuning, especially when data is scarce.

Despite these advancements, the robustness of pre-trained Transformer models in handling out-of-scope (OOS) intents remains a concern, particularly when the OOS intents are semantically similar to in-scope intents. Zhang et al. (Reference Zhang, Hashimoto, Wan, Liu, Liu, Xiong and Yu2022) investigated this issue by evaluating models such as BERT, RoBERTa, ALBERT, and ELECTRA on their ability to detect out-of-domain OOS (OODOOS) and in-domain OOS (IDOOS) intents. They found that these models struggle significantly with IDOOS detection, often assigning high confidence to misclassified IDOOS examples due to semantic overlap with in-scope intents. This limitation highlights the tradeoff between model complexity and the ability to generalize to unseen intents, indicating that confidence-based methods may be insufficient for effective OOS detection. There is a need for improved techniques that can better differentiate between semantically similar intents without overfitting to the in-scope data.

LLMs have shown promise in text classification tasks, but their deployment in resource-limited settings poses challenges due to computational costs. Loukas et al. (Reference Loukas, Stogiannidis, Diamantopoulos, Malakasiotis and Vassos2023) explored the effectiveness and cost-efficiency of LLMs for text classification in banking under such constraints. They evaluated methods including fine-tuning masked language models, contrastive learning with SetFit, in-context learning with LLMs like GPT-3.5 and GPT-4, and a cost-effective RAG technique. Their findings indicate that in-context learning with LLMs, particularly GPT-4, outperforms fine-tuned models in few-shot settings, achieving high accuracy even with fewer examples. However, the high operational costs of LLMs present a tradeoff. To mitigate this, they proposed using RAG, which significantly reduces costs while maintaining competitive performance by retrieving a small subset of relevant examples during inference. This approach balances the tradeoff between model performance and operational cost, emphasizing the need for efficient deployment strategies in practical applications.

Arabic poses special difficulties in the field of NLP, mainly because of the variety of dialects it speaks and the scarcity of domain-specific labeled datasets available. This scarcity poses a barrier to advancing Arabic NLP applications in specialized domains such as intent detection and conversational systems. Recent studies by Darwish et al. (Reference Darwish, Habash, Abbas, Al-Khalifa, Al-Natsheh, Bouamor, Bouzoubaa, Cavalli-Sforza, El-Beltagy, El-Hajj, Jarrar and Mubarak2021) and Karajah et al. (Reference Karajah, Arman and Jarrar2021) underscore the lack of labeled datasets for dialectal Arabic, particularly for specialized tasks. Furthermore, there is growing interest in Arabic and dialectal NLP specifically for financial applications, as demonstrated by the AraFinNLP 2024 Arabic Financial NLP Shared Task (Sanad et al. Reference Sanad, Mo, Saad, Mohammed, Mustafa, Sultan, Ismail and Houda2024). This task focuses on the crucial roles of Multi-dialect Intent Detection and Cross-dialect Translation in the banking sector, aiming to enhance financial communication across various Arabic dialects in response to the dynamic growth of Middle Eastern stock markets.

Despite these challenges, some strides have been made towards understanding and processing the Arabic language more effectively. An initial step forward in the domain of Arabic intent detection was taken by Mezzi et al. (Reference Mezzi, Yahyaoui, Krir, Boulila and Koubaa2022), who introduced an intent detection framework tailored for the mental health domain in Tunisian Arabic. Their innovative approach involved simulating psychiatric interviews using a 3D avatar as the interviewer, with the conversations transcribed from audio to text for processing. The application of a BERT encoder and binary classifiers for each mental health aspect under investigation-depression, suicide, panic disorder, social phobia, and adjustment disorder-yielded noteworthy results, achieving an F1 score of 0,94.

Exploring other facets of Arabic NLP, Hijjawi et al. (Reference Hijjawi, Bandar and Crockett2013) ventured into the classification of questions and statements within chatbot interactions, leveraging decision trees for this task. Their methodology was later integrated into ArabChat (Hijjawi et al. Reference Hijjawi, Bandar, Crockett and Mclean2014), enhancing the system’s ability to preprocess and understand user inputs. Moreover, Joukhadar et al. (Reference Joukhadar, Saghergy, Kweider and Ghneim2019) contributed to the field by creating a Levantine Arabic corpus, annotated with various communicative acts. Their experiments with different classifiers revealed that the Support Vector Machine (SVM), particularly when utilizing 2-gram features, was most effective, achieving an accuracy rate of 0,86.

The quest for understanding Arabic speech acts and sentiments led Elmadany et al. (Reference Elmadany, Mubarak and Magdy2018) to develop the ArSAS dataset, encompassing around 21K tweets labeled for speech-act recognition and sentiment analysis. The dataset, marked by its categorization into expressions, assertions, questions, and sentiments (negative, positive, neutral, mixed), provided a fertile ground for subsequent studies. Utilizing ArSAS, Algotiml et al. (Reference Algotiml, Elmadany and Magdy2019) employed Bidirectional Long-Short Term Memory (BiLSTM) and SVM to model these nuanced linguistic features, achieving an accuracy of 0,875 and a macro F1 score of 0,615. Lastly, Zhou et al. (Reference Zhou, Liu and Qiu2022) demonstrated the potential of contrastive-based learning to enhance model performance on out-of-domain data, testing their methodology across several datasets, including the banking domain (Casanueva et al. Reference Casanueva, Temcinas, Gerz, Handerson and Vulic2020a), and showed that it’s possible to improve adaptability without sacrificing the accuracy for in-domain data.

Research on intent detection has progressed beyond Arabic, with notable studies in languages like Urdu and Indonesian. Shams et al. (Reference Shams, Aslam and Martinez-Enriquez2019) translated key datasets to Urdu and employed CNNs, LSTMs, and BiLSTMs, finding CNNs most effective for ATIS with a 0,924 accuracy and BiLSTMs best for AOL at 0,831 accuracy. This work was further refined to achieve a 0,9112 accuracy (Shams and Aslam Reference Shams and Aslam2022). Similarly, in Indonesian, Bilah et al. (Reference Bilah, Adji and Setiawan2022) utilized ATIS to inform their CNN model, achieving a 0,9584 accuracy.

Expanding the scope, Basu et al. (Reference Basu, Sharaf, Ip Kiun Chong, Fischer, Rohra, Amoake, El–Hammamy, Nosakhare, Ramani and Han2022) explored a meta-learning approach with contrastive learning on Snips and ATIS datasets for diverse domains, emphasizing the complexity of intent detection across different contexts.

Addressing the gap in Arabic intent detection, Jarrar et al. (Reference Jarrar, Birim, Khalilia, Erden and Ghanem2023) introduced ArBanking77, an Arabized dataset from Banking77 (Casanueva et al. Reference Casanueva, Temcinas, Gerz, Handerson and Vulic2020a), enhanced with MSA and Palestinian dialect queries, totaling 31404 queries across 77 intents. A BERT-based model fine-tuned on this dataset achieved F1 scores of 0,9209 for MSA and 0,8995 for the Palestinian dialect.

In the exploration of NLP for Moroccan Darija, significant strides have been made across two pivotal areas: the development of novel datasets and the advancement in classification tasks. The creation of specialized datasets, as seen in the work of Essefar et al. (Reference Essefar, Baha, Mahdaouy, el Mekki and Berrada2023), introduces the Offensive Moroccan Comments Dataset, marking a significant step towards understanding and moderating offensive language in Moroccan Arabic. This initiative fills a gap in resources tailored to the nuances of Moroccan dialect. Similarly, Boujou (Reference Boujou2021)’s contribution of a multi-topic and multi-dialect dataset from Twitter, which includes Moroccan Darija, provides a versatile tool for sentiment analysis, topic classification, and dialect identification, underlining the importance of accessible data for diverse NLP applications. On the classification front, El Mekki et al. (Reference El Mekki, El Mahdaouy, Essefar, El Mamoun, Berrada and Khoumsi2021)’s research leverages a deep Multi-Task Learning model with a BERT encoder for nuanced language identification tasks. This model’s ability to differentiate between various Arabic dialects and MSA showcases the evolving precision in language processing techniques. Furthermore, El Mekki et al. (Reference El Mekki, El Mahdaouy, Berrada and Khoumsi2022) introduce AdaSL, an unsupervised domain adaptation framework aimed at enhancing sequence labeling tasks such as Named Entity Recognition and Part-of-Speech tagging. Their approach, which capitalizes on unlabeled data and pre-trained models, addresses the challenges posed by the scarcity of labeled datasets in Dialectal Arabic, thereby advancing the field of token and sequence classification within Moroccan Darija and other dialects.

Building on these contributions, our study specifically targets the unexplored area of intent detection for Moroccan Darija within the banking domain. We introduce DarijaBanking, an innovative dataset crafted to enhance intent detection capabilities. This dataset, developed through the Arabization, localization, deduplication, thorough cleaning, and translation of existing English banking datasets, comprises over 3600 queries across 24 intent classes, in both Modern Standard Arabic and Moroccan Darija. In exploring various methods for intent detection, we highlight the success of a BERT-based model specifically fine-tuned for this dataset, which achieved notable F1 scores of 0,98 for Darija and 0,96 for MSA.

3. Background on Moroccan Darija

In this section, we provide a brief overview of Moroccan Darija, highlighting its distinctive grammatical, syntactic, and lexical features compared to MSA, with a focus on the financial domain.

More than 32 million people in Morocco and the Moroccan diaspora speak Moroccan Darija, a spoken dialect of Arabic (Benchiba–Savenius Reference Benchiba–Savenius2011). Darija is primarily used in casual written situations like social media and ordinary discussions, in contrast to MSA, which is used in formal contexts like media, education, and official communication. Its recent appearance on digital communication platforms and its extensive use highlight the necessity for specific computing resources and tools to handle the particular problems it poses (Laoudi et al. Reference Laoudi, Bonial, Donatelli, Tratz and Voss2018).

Moroccan dialects, including those spoken in Fes, Chaouen, Oujda, and the Moroccan Sahara, are combined to form Moroccan Darija, which is not a single, uniform dialect. The Arabic dialects that were brought to the region over the centuries include the Non-Bedouin dialects brought by the first Arab conquerors in the 7th and 8th centuries, the Bedouin dialects brought by nomadic tribes like Beni Hilal and Beni Salim in the 11th century, and the Andalusian dialects brought by refugees from Spain in the 13th century. These dialects are the source of Darija. Since then, these dialects have changed as a result of exposure to Berber languages and other foreign tongues like French and Spanish, especially after Morocco’s colonial era (Benchiba–Savenius Reference Benchiba–Savenius2011).

Moroccan Darija is distinguished by its linguistic diversity and lack of standardization. It incorporates a wide range of multi-word expressions, idioms, and syntactic constructions, often displaying semi- or non-compositional meanings that differ substantially from those in MSA. This variability poses significant challenges for NLP tasks, such as machine translation, due to the informal nature of the dialect and its appearance in various scripts (Arabic and Latin) (Laoudi et al. Reference Laoudi, Bonial, Donatelli, Tratz and Voss2018). More precisely, Moroccan Darija differs from MSA in several critical areas, including but not limited to:

3.1 Grammar and syntax

Darija has different verb forms and conjugations than MSA,Footnote ^a hence its grammatical structures are different. For instance, Darija often uses the prefix (n-) for first-person singular verbs, unlike MSA, where the conjugation typically starts with the prefix (a-) (e.g., “I do” in MSA vs. “I do” in Darija). The dialect also employs distinctive particles to form yes/no questions and uses a double negation structure (e.g., “I didn’t understand you”), whereas MSA would use a single particle (e.g., ).

Darija often simplifies sentence structures. For example:

• MSA: (What should I do?)
• Darija: (What do I do?)

Conditional expressions also show unique features:

• Darija: (If I want) compared to MSA:
• Darija: (I won’t leave my account in your bank even if you give me gold) compared to MSA:

Darija often omits particles that are mandatory in MSA:

• MSA: (Can I cancel…?) compared to Darija:

Darija uses a different pronoun for “it” when it is attached directly to verbs:

• MSA: (I ordered it, I opened it) compared to Darija:

Darija also uses the particle (one) to indicate indefiniteness:

• MSA: (a boy), (a girl)
• Darija: (a boy), (a girl)

3.2 Vocabulary and loanwords

Moroccan Darija is more linguistically liberal than MSA, incorporating a large number of loanwords and calques from languages like French, Spanish, and Berber. For instance:

• MSA: (card) compared to Darija: —from French “carte”
• MSA: (wallet) compared to Darija: —from Spanish “buzón”
• MSA: (car) compared to Darija: —from French “automobile”
• MSA: (loan) compared to Darija: —from French “crédit”
• MSA: (account) compared to Darija: —from French “compte”

Additionally, Darija regularly uses colloquial language and slang that are not present in MSA:

• MSA: (This is very wonderful) compared to Darija:

3.3 Idiomatic expressions and colloquial terms

There are many idioms and colloquial words in Darija that are absent from MSA. These idioms frequently have contextual and cultural connotations that make them difficult to translate accurately into other languages. For example:

• Darija: (May God give you health)—expresses deep gratitude, whereas MSA would use (thanks).
• Darija: (Please) versus MSA:
• Darija: (Thanks; literally: pride) versus MSA:
• Darija: (Thank you; literally: May God have mercy on your parents)

3.4 Pronouns and question words

Darija utilizes different pronouns and question words compared to MSA:

• MSA: (What is the best way…?) compared to Darija:
• MSA: (How can I…?) compared to Darija: MSA: (Where can I…?) compared to Darija:
• MSA: (When will the card arrive?) compared to Darija:
• MSA: (Why?) compared to Darija:

Darija uses both independent and dependent personal pronouns. Here are the ones that differ from MSA:

3.5 Independent pronouns ()

• You, masculine:
• You, feminine:
• We:
• You all:
• They

3.6 Dependent pronouns ()

• Him: or

3.7 Verb conjugation and tense structures

Darija displays unique conjugation patterns and tense markers. The future tense often uses :

• MSA: (I will go) compared to Darija:

The present continuous form is expressed using :

• MSA: (I am using the card) compared to Darija:

3.8 Prepositions and possessive structures

Darija often uses different prepositions and possessive markers:

• MSA: (in my account) compared to Darija:
• MSA: (to there) compared to Darija:

Darija uses to denote possession:

• MSA: (my card) compared to Darija:

3.9 Relative clauses

Relative clauses in Darija are typically introduced using :

• Darija: (The person who gave me the card)

In order to enable NLP tasks like intent detection, Darija’s distinct language characteristics—which include variations in phonology, morphology, grammar, syntax, lexicon, and orthography—underline the necessity for specific linguistic and computational resources. The incompatibility of current MSA-designed resources with Darija calls for concerted efforts to overcome these obstacles.

4. The DarijaBanking corpus

The DarijaBanking CorpusFootnote ^b represents a new effort to tailor and enrich banking-related linguistic resources specifically for the Moroccan context, leveraging the foundational structures of three significant English datasets, namely (1) Banking77Footnote ^c (Casanueva et al. Reference Casanueva, Temcinas, Gerz, Handerson and Vulic2020a), which encompasses an extensive collection of 13083 banking-related queries, each meticulously categorized into one of 77 distinct intents; (2) the banking-faq-bot datasetFootnote ^d (Patel 2017), comprising 1764 questions distributed across seven intent categories; and (3) the smart-banking-chatbot datasetFootnote ^e (Lakkad Reference Lakkad2018), that includes a broad spectrum of 30100 questions sorted into 41 intents. Collectively, these resources amalgamate into a comprehensive repository of 42000 questions. Subsequently, we will detail the various stages involved in our corpus’s data collection and validation process.

4.1 Data collection

The Data Collection comprises 4 phases:

4.1.1 Phase I: Cleaning

The first step in developing the DarijaBanking corpus was a rigorous cleaning process tailored to align the dataset with the nuances of Morocco’s banking sector. This essential phase focused on eliminating queries and intents associated with banking practices and products common in countries like the US or UK but irrelevant or nonexistent in Morocco. We aimed to exclude references to unfamiliar banking services within the Moroccan banking environment, such as specific types of loans, investment opportunities, or account functionalities unavailable in local banks. For instance, intents related to “Apple Pay or Google Pay,” “Automatic Top-Up,” “Disposable Card Limits,” “Exchange Via App,” “Get Disposable Virtual Card,” “Topping Up by Card,” and “Virtual Card Not Working” were removed due to their limited relevance to Moroccan banking users. This is because the penetration of digital wallet services such as Apple Pay and Google Pay is not as extensive in Morocco, making these services less applicable. Additionally, the concept of automatic top-ups, disposable virtual cards, and app-based foreign exchange are indicative of fintech advancements not fully adopted or supported by Moroccan banking institutions, which tend to offer more traditional services. The use of virtual cards and card-based account top-ups, while increasing worldwide, might not yet align with common banking practices or the digital infrastructure in Morocco.

The cleaning also involved a critical step of utterance-level filtering to bolster the corpus’s relevance to the Moroccan banking context, by eliminating references to:

• Transactions linked to UK bank accounts, for example “I made a transfer from my UK bank account.”
• Queries on services for non-residents, for example “Can I get a card if I’m not in the UK?”
• The use of foreign currencies and credit cards, for example “I need Australian dollars,” “Can I use a US credit card to top-up?”
• International transfers and related issues, for example “How long does a transfer from the UK take?” “My transfer from the UK isn’t showing.”

Removing utterances involving international banking contexts, uncommon currency conversions in Morocco, or services related to foreign accounts made the corpus more reflective of Moroccan banking scenarios.

A subsequent challenge was addressing ambiguous intent clusters that hindered clear intent detection. Clusters such as “card_not_working” and “compromised_card” exemplified the issue, with the former combining intents like “activate_my_card” and “declined_card_payment,” and the latter grouping “compromised_card” with “lost_or_stolen_card.” Similarly, “transfer_ problems” and “identity_verification” clusters highlighted the difficulty in distinguishing between closely related intents, such as “failed_transfer” and “transfer_not_received_by_recipient,” or “unable_to_verify_identity” and “verify_my_identity,” respectively. These clusters revealed substantial overlap and frequent misclassification of intents, not just by models but also by human annotators, complicating the dataset’s utility for accurate intent recognition. For example, an utterance like “I might need a new card because it’s not working at any of the ATMs” could be interpreted as both “card not working” and “declined_cash_withdrawal.” Such ambiguities led to inconsistent classifications, making it difficult to fairly assess model performance. To address these inherent problems, we refined the dataset by merging similar intents and eliminating those that were prone to ambiguity and misclassification errors. This consolidation was not intended to make the classification task easier but to provide a clearer and more consistent framework for evaluating intent detection. By focusing on broader, more distinguishable intent categories, we aimed to enhance the dataset’s reliability and ensure a fairer evaluation of model performance in the context of Moroccan banking.

Furthermore, deduplication was a key step in refining the DarijaBanking corpus, targeting nearly identical queries with slight wording differences but the same intent, such as inquiries about SWIFT transfers or using Apple Pay. This step improved the dataset’s precision, aiding the intent detection model in more accurately classifying customer intents in Morocco’s banking sector.

Through detailed refinement, including deduplication and correction of incorrect utterance-intent associations, we developed a polished collection of 1660 English sentences, significantly enhancing the dataset’s utility for accurate intent recognition within the Moroccan banking context.

4.1.2 Phase II: The two additional intents IDOOS and OODOOS

In the subsequent phase, we expanded the DarijaBanking corpus by incorporating two additional intent categories: In-Domain Out-Of-Scope (IDOOS) and Out-Of-Domain Out-Of-Scope (OODOOS). IDOOS encompasses requests that, while not previously listed, remain pertinent to the banking sector, such as inquiries about Western Union facilities. By adding IDOOS, we ensure the chatbot can recognize and manage banking-related queries that fall outside the predefined set of intents. This expansion allows the bot to cater to a broader array of banking inquiries, enhancing its utility and user satisfaction by reducing instances where legitimate banking questions go unrecognized. Conversely, OODOOS covers requests that are entirely unrelated to banking, exemplified by questions regarding the distance between the Earth and the Moon. The inclusion of OODOOS helps the chatbot identify and appropriately handle queries unrelated to banking. This distinction is crucial for maintaining the chatbot’s focus on banking topics and directing users to suitable resources for their non-banking questions, thereby improving the overall user experience by preventing irrelevant responses. For each of these new intent categories, we integrated 70 English utterances generated by ChatGPT, enriching the corpus to encompass a total of 1800 English sentences across 24 distinct intents. Examples of these intents include:

4.1.2.1 IDOOS examples

• Are there any blackout dates or restrictions on redeeming my rewards?
• Can you explain the different types of retirement accounts?
• (How do I learn to invest in the stock market?)
• (How do I use health savings account funds to invest?)
• (How do I earn reward points with my card?)

4.1.2.2 OODOOS examples

• (What is the average lifespan of a cat?)
• (What’s the best way to learn a new language?)
• (Who won last year’s World Cup?)

4.1.3 Phase III: Automatic translation

In this phase, the cleaned English sentences were translated into French, Arabic, and Darija employing the OPUS MT (Jörg and Santhosh Reference Jörg and Santhosh2020), Turjuman (Nagoudi et al. Reference Nagoudi, Elmadany and Abdul-Mageed2022), and GPT-4 (OpenAI 2024a) models, respectively, for this purpose.

4.1.4 Phase IV: Manual verification & correction

Here the emphasis was placed not only on linguistic precision but also on ensuring that the translations were contextually aligned with Moroccan banking terminology. The endeavor went beyond mere translation to include the crucial aspect of localization. For instance, the term “transfer” in a banking context was appropriately translated not as , which conveys a general sense of moving or transferring, but as , accurately capturing the notion of financial transfers. This understanding of banking terminology and its correct application is vital for the dataset’s relevance and utility. Additional examples include “savings account,” which required careful translation to to avoid a generic rendition, and “loan approval,” which was precisely translated to , eschewing a less appropriate literal translation. These examples show how important it is to take a nuanced approach to localization and translation in order to make sure the dataset is compatible with the nuances of Moroccan banking operations and language.

Table 1. Some examples of manually corrected translations from English to Darija

The process of manual correction and verification of the translations from English to Moroccan Darija was conducted by a team of five human labelers who are native speakers of Moroccan Darija. The 1,800 sentences were divided equally among these five labelers, with each speaker independently reviewing and correcting their assigned sentences. One of the labelers, who is also an author of this work, cross-checked all the corrections made by the other four to ensure overall consistency. The level of disagreement was minimal and was resolved on the spot, hence we did not report an exact Inter-Rater Reliability. Approximately 47% of the translated utterances were edited to enhance accuracy and idiomatic expression. These labelers reviewed the initial translations provided by GPT-4 and made necessary adjustments to ensure that the translations were accurate and idiomatic to Moroccan Darija speakers.

The types of corrections that were performed included standardizing terms, correcting verb conjugations, and ensuring the use of dialectally appropriate vocabulary. For example, the initial translation for “Will deactivating my card affect my automatic bill payments?” was which was corrected to by replacing the initial word with to better fit a question structure in Moroccan Darija. Some more examples are illustrated in Table 1. In broader terms, the corrections made sure that the translations were not only linguistically accurate but also culturally resonant, taking into account the nuances of Moroccan Darija and its syntactic and lexical particularities. This quality control process is crucial in ensuring that translations are not only understood but also feel natural to Moroccan Darija speakers.

Table 2. Comprehensive Intent Catalogue

4.2 Comprehensive intent catalogue

Table 2 delineates a comprehensive list of intents featured in the corpus, along with their corresponding definitions. This overview serves to illustrate the dataset’s scope concerning customer interactions and inquiries specific to the banking industry. In total, the corpus comprises 22 intent, plus the additional IDOOS and OODOOS as described above. The number of utterances per intent ranges from 58 to 99, with an average of 74, indicating a well-balanced distribution across the dataset (Table 3).

Table 3. Mean linguistic differences per intent between Darija and MSA, rated on a scale from 1 (very close) to 3 (far apart)

4.3 Data segmentation and descriptive analysis

The final dataset formed a collection comprising 7200 queries distributed among English, French, Arabic, and Darija languages, with a concentrated segment of 3600 queries specifically in Arabic and Darija, spanning across 24 distinct intents. The division of the dataset for training and testing purposes followed an 80:20 ratio, respectively, opting not to designate a separate validation subset due to the manageable size of the dataset. This stratified split was designed to maintain the original proportion of utterances per intent, ensuring that these ratios were consistently mirrored in both the training and testing subsets. Table 4 provides detailed statistics on the DarijaBanking corpus, offering insights into the dataset’s composition and the distribution of utterances among the various intents.

Table 4. Statistics of DarijaBanking dataset

4.4 Labeling Darija-MSA linguistic divergence in the test set

For the test set split, we aimed to explore how far Darija’s utterances diverge from their MSA counterparts. To achieve this, each utterance was manually labeled with a score of 1 (very close to MSA), 2, or 3 (far away from MSA). The goal of this annotation was to capture the degree of similarity between the Darija and MSA variants, offering a finer understanding of how distinct Darija utterances are from MSA.

Upon analyzing the data, the mean difference between Darija and MSA utterances by intent ranged from 1,53 to 2,18, reflecting consistency in annotation. This result indicates that while Darija in this context is not extremely distant from MSA, it is also not too close, providing a meaningful differentiation in linguistic features. The overall mean of these differences is 1,9, with a standard deviation of 0,16 and a median value of 1,87, which aligns closely with the mean, suggesting a stable distribution.

Table 3 illustrates the mean differences by intent, underscoring the consistency of the annotation process and offering insights into the variability of Darija’s divergence from MSA across different intents.

5. Intent detection approaches

In this section, we embark on a systematic comparison of diverse methodologies for intent detection, which can broadly be categorized into three distinct approaches: BERT-like models fine-tuning, retrieval-based strategies, and LLM prompting. The fine-tuning of BERT-like models Devlin et al. (Reference Devlin, Chang, Lee and Toutanova2019) includes both monolingual and multilingual variations, where we specifically focus on adapting monolingual Arabic models for Arabic and Moroccan Darija and employ cross-lingual transfer techniques with multilingual models like XLM-RoBERTa (Conneau et al. Reference Conneau, Khandelwal, Goyal, Chaudhary, Wenzek, Guzmán, Grave, Ott, Zettlemoyer and Stoyanov2020) across English, French, Arabic, and Darija datasets. Furthermore, we investigate zero-shot learning scenarios, where the model’s performance is tested on Arabic and Moroccan Darija without prior exposure to these languages during training, to assess its capability to generalize across languages or dialects not included in the training data. Meanwhile, the retrieval-based strategy leverages text embedding models to index and match queries to the nearest utterance, thus inferring intent based on semantic similarity. Lastly, LLM prompting involves the utilization of advanced models such as ChatGPT-3.5 OpenAI (2022) and GPT-4 (OpenAI 2024a), which are prompted to classify intents by providing a comprehensive list of intents and their descriptions. Each of these approaches offers a unique perspective on intent detection, highlighting the versatility and adaptability of current technologies to understand and classify user intents across a range of languages and contexts.

5.1 BERT-like models fine-tuning

The BERT model, developed by Google in 2018, has revolutionized the field of NLP by introducing a transformer-based architecture that excels in a wide range of common language tasks, including sentiment analysis, named entity recognition, and question answering (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019). The core of BERT’s architecture is the transformer, utilizing an attention mechanism that efficiently learns contextual relationships between words in a given sequence. This architecture comprises two primary components: an encoder that processes the input text, and a decoder that generates predictions for the task at hand, such as masked token prediction or next-sentence prediction. This innovative approach has enabled BERT to achieve state-of-the-art performance across various NLP benchmarks. In the context of this paper, we finetune a pre-trained BERT model for intent detection. To adapt BERT for the intent classification task, a single linear layer is appended to the pre-existing transformer layers of the BERT model. This modification allows for the direct application of BERT’s contextual embeddings to the task of intent detection, leveraging its understanding of language nuances to achieve high accuracy.

Given the linguistic diversity and complexity of the Arabic language and its dialects, the performance of multilingual pre-trained transformers, including BERT, often varies. Recognizing this challenge, researchers have developed several BERT-like models tailored to Arabic and its dialects. These models have been pre-trained on vast corpora of Arabic text, encompassing both MSA and various regional dialects, to capture the rich linguistic features unique to Arabic. Among these models, AraBERT (Antoun, Baly, and Hajj Reference Antoun, Baly and Hajj2020) stands out, having been trained on substantial Arabic datasets, including the 1.5 billion-word Abu El-Khair corpus and the 3.5 million-article OSIAN corpus. Similarly, ARBERT and MARBERT (Abdul-Mageed, Elmadany, and Nagoudi Reference Abdul-Mageed, Elmadany and Nagoudi2021), as well as MARBERTv2, have been trained on extensive collections of MSA and dialectical Arabic texts, with MARBERTv2 benefiting from even more data and longer training regimes. QARiB (Abdelali et al. Reference Abdelali, Hassan, Mubarak, Darwish and Samih2021) represents another significant contribution, developed by the Qatar Computing Research Institute and trained on a diverse mix of Arabic Gigaword, Abu El-Khair Corpus, and Open Subtitles. Lastly, CAMeLBERT-Mix (Inoue et al. Reference Inoue, Alhafni, Baimukan, Bouamor and Habash2021) incorporates a broad spectrum of MSA and dialectical Arabic sources, including the Gigaword Fifth Edition and the OpenITI corpus, to create a model that is well-suited for a wide range of Arabic NLP tasks. Building on the landscape of BERT-like language models, XLM-RoBERTa (Conneau et al. Reference Conneau, Khandelwal, Goyal, Chaudhary, Wenzek, Guzmán, Grave, Ott, Zettlemoyer and Stoyanov2020) emerges as a pivotal development, extending the capabilities of language understanding beyond single-language models. XLM-RoBERTa is architected on the robust foundation of RoBERTa (Liu et al. Reference Liu, Ott, Goyal, Du, Joshi, Chen, Levy, Lewis, Zettlemoyer and Stoyanov2019), leveraging a Transformer-based framework to foster deep contextual understanding across languages. Trained on an expansive corpus encompassing texts from 100 different languages, XLM-RoBERTa utilizes a large-scale approach without the need for parallel corpora, focusing on the Masked Language Model objective to predict randomly masked tokens within a text for profound contextual comprehension. Additionally, it incorporates the Translation Language Model objective in specific training setups, where it learns to predict masked tokens in bilingual text pairs, further enhancing its cross-lingual capabilities. In this paper, we evaluate these Arabic pre-trained transformer models, alongside the multilingual XLM-Roberta (Conneau et al. Reference Conneau, Khandelwal, Goyal, Chaudhary, Wenzek, Guzmán, Grave, Ott, Zettlemoyer and Stoyanov2020) on the DarijaBanking dataset.

5.2 Retrieval-based intent detection

Retrieval-based intent detection represents a pragmatic approach towards understanding and classifying user intents. By employing sophisticated text embedding models, each utterance within the training dataset is transformed into a dense vector representation. When a new query is received, it is also embedded into the same vector space. The intent of the query is inferred by identifying the nearest utterance in the training set, where “nearest” is quantified by the highest cosine similarity between the query’s embedding and those of the dataset utterances. This method hinges on the premise that semantically similar utterances, when properly embedded, will occupy proximate regions in the embedding space, facilitating an efficient and effective retrieval-based classification.

5.2.1 Neural architectures for text embedding models

Text embedding models, particularly those inspired by the BERT architecture, have been central to advancements in NLP. However, traditional BERT models do not directly compute sentence embeddings, which poses challenges for applications requiring fixed-length vector representations of text. A common workaround involves averaging the output vectors or utilizing the special CLS token’s output, though these methods often result in suboptimal sentence embeddings. To address these limitations, Sentence-BERT (Reimers and Gurevych Reference Reimers and Gurevych2019) was introduced as an adaptation of the BERT framework, incorporating Siamese and triplet network structures to generate semantically meaningful sentence embeddings. These embeddings can then be compared using cosine similarity, offering a significant boost in efficiency and effectiveness across various sentence-level tasks.

5.2.2 Arabic text embedding models

Although models like AraBERT (Antoun et al. Reference Antoun, Baly and Hajj2020), ARBERT and MARBERT (Abdul-Mageed et al. Reference Abdul-Mageed, Elmadany and Nagoudi2021), MARBERTv2, QARiB (Abdelali et al. Reference Abdelali, Hassan, Mubarak, Darwish and Samih2021), and CAMeLBERT-Mix (Inoue et al. Reference Inoue, Alhafni, Baimukan, Bouamor and Habash2021) are not primarily designed for retrieval tasks, they can be adapted for such purposes by using the embedding of the CLS token. This approach allows these pre-trained language models to be leveraged for retrieval-based intent detection by mapping queries and dataset utterances into the same embedding space and comparing them using similarity metrics.

5.2.3 Multilingual text embedding models supporting Arabic

Several models have emerged to support multilingual and dialect-specific embeddings, crucial for applications involving Arabic and its dialects:

• Sentence Transformers and Knowledge Distillation: The approach taken by models such as the one described by (Reimers and Gurevych Reference Reimers and Gurevych2020a) combines the strength of Sentence Transformers with the concept of knowledge distillation. Here, a monolingual “teacher” model guides a “student” model to produce comparable sentence embeddings across languages, facilitated by training on parallel corpora. Within this family of models, several have been specifically considered for their ability to handle Arabic. These include distiluse-base-multilingual-cased-v1 (Reimers and Gurevych Reference Reimers and Gurevych2020b) which operates within a 512-dimensional dense vector space; paraphrase-multilingual-mpnet-base-v2 (Reimers and Gurevych Reference Reimers and Gurevych2020d), offering embeddings in a 768-dimensional space; and paraphrase-multilingual-MiniLM-L12-v2 (Reimers and Gurevych Reference Reimers and Gurevych2020c), which provides embeddings in a 384-dimensional dense vector space.
• LaBSE (Language-Agnostic BERT Sentence Embedding): Google’s LaBSE model (Feng et al. Reference Feng, Yang, Cer, Arivazhagan and Wang2020) represents a significant leap forward, blending techniques for monolingual and cross-lingual embedding generation. By integrating masked language modeling, translation language modeling, and other advanced methods, LaBSE achieves robust multilingual and dialectal coverage. LaBSE offers embeddings in a 768-dimensional space.
• LASER (Language-Agnostic SEntence Representations): Developed by Meta, LASER (Artetxe and Schwenk Reference Artetxe and Schwenk2019) employs a BiLSTM encoder with a shared BPE vocabulary across languages. This model excels in generating language-agnostic sentence embeddings, supported by extensive parallel corpora training. LASER offers embeddings in a 1024-dimensional space.
• E5 Model: The E5 model (Wang et al. Reference Wang, Yang, Huang, Jiao, Yang, Jiang, Majumder and Wei2022) introduces a contrastive training approach with weak supervision, yielding embeddings that excel in retrieval, clustering, and classification tasks across languages, including zero-shot and fine-tuned scenarios. Within this family of models, the base model with its 12 layers was used. It provides embeddings in a 768-dimensional space.
• OpenAI’s text-embedding-3-large (OpenAI 2024b): This model represents OpenAI’s latest advancement in text embedding technology. It’s designed for high performance and flexibility, offering embeddings with up to 3072 dimensions. This model is particularly notable for its enhanced multilingual performance, making it a valuable tool for tasks requiring a nuanced understanding of text across various languages.

Despite the advancements in text embedding models and their application to a multitude of languages, it is important to acknowledge a significant limitation when it comes to handling Arabic and its diverse dialects. The aforementioned models, while multilingual and capable of supporting Arabic to a degree, are not specifically designed with Arabic and its dialects as a primary focus. Their training datasets predominantly encompass MSA, with limited or, in some cases, no exposure to the various dialects spoken across the Arab world. This gap underscores a critical challenge in the field of NLP: the development of embedding models that can accurately capture the nuances and variations inherent in dialectal Arabic.

5.3 Intent detection by LLM prompting

This approach leverages the capabilities of LLMs such as ChatGPT-3.5 (OpenAI 2022), GPT-4 (OpenAI 2024a), and JAIS-13B Chat (Sengupta et al. Reference Sengupta, Sahu, Jia, Katipomu, Li, Koto, Marshall, Gosal, Liu, Chen, Afzal, Kamboj, Pandit, Pal, Pradhan, Mujahid, Baali, Han, Bsharat, Aji, Shen, Liu, Vassilieva, Hestness, Hock, Feldman, Lee, Jackson, Ren, Nakov, Baldwin and Xing2023) to classify customer intents through a strategic prompting methodology. These models were chosen because they are among the most capable and proven efficient LLMs in Arabic at the time of the writing of this work. Specifically, JAIS-13B was selected over the larger JAIS 30B due to limitations in compute power. This method provides the LLM with detailed context, its role as a classifier, and an extensive list of 24 predefined intents alongside their descriptions. The prompt is structured as follows:

[[ Context: You are an advanced banking chatbot designed for a Moroccan bank, equipped to assist customers with a range of inquiries and services related to banking. Your capabilities extend from handling basic account management to addressing complex service requests. Your primary objective is to accurately discern the customer’s intent from their utterances, using the list of predefined intents to provide relevant assistance or guide them to the appropriate service channel.

Here is the list of all intents and their meanings:

• activate_my_card: Initiate the use of a new banking card.
• age_limit: Inquire about the minimum age requirement for a service.
• cancel_order: Request to cancel a previously placed order. $\ldots$
• oodoos: An intent not in the list of intents and not related to banking, like asking the distance between the Earth and the Moon.

When you receive the 5 utterances from a customer, analyze the content to determine the most applicable intents. Consider the context of banking practices in Morocco, including services and customer expectations.

Instructions: 1. Read the customer’s utterances carefully. 2. Identify the most relevant intent for each utterance from the predefined list. 3. Return the detected intents in JSON format for easy parsing: “‘ {“intents”: [“intent1”, “intent2”, “intent3”, “intent4”, “intent5”]} “‘

Make sure to return only one intent for each utterance. Select the intent that best matches the customer’s query or service need for each of the five utterances. If an utterance does not fit any predefined intents or falls outside the banking domain, use “oodoos” for unrelated queries and “idoos” for banking-related queries not listed among the predefined intents.]]

Given the complexity and the need for efficiency, multiple utterances were classified within the same prompt rather than individually, optimizing both cost and computational resources. However, for JAIS-13B Chat, the prompt was simplified further, providing only one utterance at a time instead of five, as the model becomes completely puzzled with multiple utterances. Additionally, we experimented with both the English prompt and its translation into Arabic (while keeping the intent label names in English) to evaluate any performance differences. The adaptation also involved adjusting the formatting to suit the specifics of the JAIS prompting schema.

5.4 Intent detection through NMT pipeline

One effective strategy that balances intuition and competitiveness is to employ a neural machine translation (NMT) pipeline. This approach entails translating text from a low-resource language to a high-resource language before executing natural language processing tasks. Research by (Boujou et al. Reference Boujou, Chataoui, Mekki, Benjelloun, Chairi and Berrada2023) demonstrated that integrating a translation pipeline can yield high-quality NLP outcomes for low-resource languages. In our specific scenario, we could translate Darija queries into English and subsequently apply English-based Intent Detection models for further analysis.

6. Experiments and results

In this section, we present a comprehensive analysis of our experimental studies, conducted to assess the effectiveness of the three distinct approaches previously discussed. We first outline the key hyperparameters and configurations used across our experiments to ensure reproducibility and clarity. Following this, in subsection 5.1, we delve into the nuances of fine-tuning BERT-like models. This subsection is bifurcated into an examination of both the application of zero-shot learning techniques and the comprehensive fine-tuning process. Subsequently, subsection 5.2 is dedicated to exploring the Retrieval-based Intent Detection method, wherein we benchmark the various models listed to evaluate their performance. In subsection 5.3, we shift our focus to investigating intent detection through the lens of LLM Prompting techniques, offering a fresh perspective on this approach. The culmination of our experimental journey is presented in subsection 5.4, where we discuss the outcomes of our investigations, aiming to extract meaningful insights and implications from the gathered data. To provide a robust evaluation of model efficacy, we report performance metrics on the test set, including macro F1 scores, precision, and recall, thereby ensuring a holistic assessment of each model’s capabilities.

6.1 Experimental setup and hyperparameters

To ensure consistency across our experiments, we adhered primarily to the default settings detailed in the “Transformers Training Arguments” (HuggingFace 2024b), with specific modifications to optimize performance for our task:

• The maximum sequence length was set to 128, considering that the sentences in our dataset were relatively short.
• A batch size of 64 was used for training.
• Training was conducted using a single T4 GPU.
• We employed the AdamW optimizer, known for its efficacy with sparse gradients in BERT-like architectures, with an initial learning rate of 5e-5. Weight decay was set to 0 for non-bias and non-LayerNorm weights, following standard recommendations (Loshchilov and Hutter Reference Loshchilov and Hutter2019).
• The beta coefficients for the AdamW optimizer were kept at the standard values of 0,9 (beta1) and 0,999 (beta2), with an epsilon of 1e-8 to prevent division by zero during optimization. These settings are in line with the defaults recommended in the PyTorch and Hugging Face documentation (Pytorch 2023; HuggingFace 2024a).
• Gradient clipping was applied with a maximum gradient norm of 1,0 to manage gradient variability, a practice supported by research suggesting that it can accelerate training (Zhang et al. Reference Zhang2020). A linear learning rate scheduler was utilized, with no warmup steps, allowing the model’s performance to dictate the adjustments.
• Gradient accumulation occurred after each batch to enhance computational efficiency.
• Training was conducted over 20 epochs, with the best checkpoint selected based on performance on an evaluation split.

6.2 BERT-like models fine-tuning results

6.2.1 Zero-shot cross-lingual transfer learning

This section explores the efficacy of zero-shot cross-lingual transfer learning on MSA and Darija, employing the XLM-Roberta model. We conduct experiments to assess the performance of this approach using the DarijaBanking English + French training dataset. Initial results revealed an F1 score of 74,06 for MSA and 47,61 for Darija (Table 5). Further experimentation, including training solely on Arabic and testing on Darija, demonstrated that adding English and French data alongside Arabic improves performance: the F1 score for Darija increased from 74,53 (Arabic-only) to 80,76. However, for MSA, the F1 score slightly decreased from 94,27 (Arabic-only) to 93,10, which is a natural variation. The most comprehensive training set, encompassing English, French, Arabic, and Darija, allowed XLM-R to achieve F1 scores of 95 for MSA and 93,64 for Darija. These findings underscore the challenges faced by multilingual pre-trained models in accurately capturing the nuances of MSA and dialectical Arabic, highlighting the necessity for dedicated data annotations in these languages.

Table 5. Performance of XLM-Roberta Zero-Shot Learning: Gains from Sequential Language Integration in the DarijaBanking Dataset

6.2.2 Comprehensive fine-tuning of pre-trained transformers

In the preceding section, we observed that multilingual pre-trained transformers exhibit suboptimal performance on both MSA and Darija, with Darija presenting particular challenges. This section extends our evaluation to a range of Arabic pre-trained transformer models that were previously introduced in section 5.1, including XLM-R, using the DarijaBanking dataset. We present the results in Table 6, ranking the models according to their F1 scores in the Darija test set. Notably, Arabertv02-twitter emerges as the top-performing model, achieving impressive F1 scores of 95,55 for MSA and 97,87 for Darija. The model’s efficiency is particularly noteworthy given its relatively small size, which allows for deployment on standard CPUs without compromising performance. We refer to this fine-tuned version as BERTouch, which we introduce as a key contribution, being the best-performing model for intent detection in the DarijaBanking dataset.

Table 6. Performance of various pre-trained transformers on DarijaBanking

6.3 Retrieval-based intent detection results

In this section, we conduct a comprehensive evaluation of the retrieval-based intent detection approach, leveraging the capabilities of pre-trained text embedding models as detailed in Section 5.2, with a focus on their performance using the DarijaBanking dataset. The assessment’s outcomes are systematically presented in Table 7, where models are ordered according to their F1 scores derived from the Darija test set. As expected, the BERT-like models, when adapted for retrieval by using the embedding of their CLS token, showed the poorest performance, confirming their limitations for this task. Among the models evaluated, the “text-embedding-3-large” model by OpenAI stands out in the closed-source category, along with Microsoft’s “multilingual-e5-base” in the open-source domain. The former demonstrates exceptional performance with F1 scores of 90,70 for MSA and 88,44 for Darija. The latter, while slightly trailing with F1 scores of 88,91 for MSA and 86,89 for Darija, offers substantial advantages in terms of efficiency and deployability. Its smaller size enables deployment on standard CPUs without a loss in performance, and its open-source nature facilitates in-house deployment.

Table 7. Performance of various pre-trained Retrievers as Intent Detectors on DarijaBanking

6.4 Intent detection by LLM prompting results

In this section, we explore the application of LLMs for intent detection, with a particular focus on their performance within the context of the DarijaBanking dataset. Despite the sophisticated linguistic processing and generative capabilities of these models, their performance in the task of intent detection within the DarijaBanking dataset leaves much to be desired. ChatGPT-3.5, in particular, showcases a substantial gap in effectiveness, delivering results that fall short of expectations. GPT-4, although slightly more proficient, still only achieves what can best be described as mediocre performance. This outcome is notably surprising, considering GPT-4’s advanced language understanding and generation abilities, including its application to languages and dialects as complex as Moroccan Darija. The performance of the Jais-13B model further complicates the landscape. This model demonstrates a notable difficulty in accurately aligning with the predefined set of 24 intents, often misclassifying or completely missing the correct intent. This inconsistency underscores the limitations of Jais-13B as an intent classifier despite its potential advantages in other generative applications. The evidence suggests that while Jais-13B may excel in content generation, its utility as a classifier in intent detection tasks, especially those involving the DarijaBanking dataset, is limited. These findings indicate that the general-purpose nature of LLMs might not be ideally suited for specific classification tasks such as intent detection, particularly when dealing with languages or dialects with less online presence and resources. The study underscores the necessity for a more nuanced approach, suggesting that developing and fine-tuning smaller, domain-specific language models could offer a more effective solution, as shown in subsection 6.2. Table 8 reports the results obtained

Table 8. Performance of LLMs as Intent Detectors on DarijaBanking

6.5 Intent detection through NMT pipeline

In this part, we explore the NMT-based pipeline that leverages well-resourced English models for intent classification. The first step of this pipeline involves the automatic translation of queries from target languages, such as Darija, to English using proficient models. Subsequently, the resulting English queries are then fed into an English intent classification model.

For English Intent Classification, we fine-tuned a Bert-base-uncased model on the English queries extracted from our training set, employing identical hyperparameters as the other models.

For the evaluation phase, we employed the large multilingual hf-seamless-m4t-large model, recently introduced by Meta (Communication et al. Reference Communication, Barrault, Chung, Meglioli, Dale, Dong, Duquenne, Elsahar, Gong, Heffernan, Hoffman, Klaiber, Li, Licht, Maillard, Rakotoarison, Sadagopan, Wenzek, Ye, Akula, Chen, Hachem, Ellis, Gonzalez, Haaheim, Hansanti, Howes, Huang, Hwang, Inaguma, Jain, Kalbassi, Kallet, Kulikov, Lam, Li, Ma, Mavlyutov, Peloquin, Ramadan, Ramakrishnan, Sun, Tran, Tran, Tufanov, Vogeti, Wood, Yang, Yu, Andrews, Balioglu, Costa-jussà, Celebi, Elbayad, Gao, Guzmán, Kao, Lee, Mourachko, Pino, Popuri, Ropers, Saleem, Schwenk, Tomasello, Wang, Wang and Wang2023), to translate the test queries from both Darija and MSA to English. Subsequently, we computed the corpus-level BLEU score between the original English queries and their translated counterparts, resulting in a score of 0,328.

The results presented in Table 9 demonstrate that the NMT pipeline-based solution outperforms other models including GPT-4 for both MSA and Darija in terms of F1 score, achieving 83,70 and 89,03, respectively.

Table 9. Results of the NMT pipeline on both Darija and MSA

6.6 Error analysis

This section provides an in-depth analysis of the errors observed across different intent detection approaches, focusing on key questions such as the variability of intent classification difficulty, the potential influence of linguistic divergence between Darija and MSA, and specific issues surrounding the IDOOS intent.

6.6.1 Are some intents easier to predict across all experiments?

Table 10 highlights the relative ease of predicting certain intents across all three approaches: BERT-based fine-tuning (BERTouch), retrieval-based intent detection (e5 model), and LLM prompting (GPT-4). The three easiest intents cancel_order, change_order, and get_invoice all consistently scored above 95% F1. This high performance can be attributed to the distinctiveness of these intents in the dataset. Specifically, cancel_order and change_order both deal with the concept of ordering, making them relatively easy to distinguish from other intents. Additionally, get_invoice benefited from having the highest number of utterances (20), resulting in a more stable F1 score, as misclassifying one or two instances had a smaller impact on the overall score.

Table 10. Comparative F1 scores of three models across various customer intents

6.6.2 Are there intents that are particularly difficult to predict across all experiments?

On the opposite end, the intents IDOOS, general_negative_feedback, and compromised_card consistently scored below 80% F1, marking them as the most difficult intents to classify. The IDOOS intent, in particular, performed the worst with an average F1 score of 60%. This is consistent with the literature, which indicates that transformers struggle significantly with out-of-scope intents (Zhang et al. Reference Zhang, Hashimoto, Wan, Liu, Liu, Xiong and Yu2022). Misclassifications of the general_negative_feedback and compromised_card intents can often be traced to their overlap with other intents in the dataset, leading to confusion across models.

6.6.3 Are lower F1 scores for a given intent linked to linguistic divergence between Darija and MSA?

To examine the impact of linguistic divergence, we explored whether there was a correlation between the level of difference between Darija utterances and their MSA counterparts and the difficulty in intent classification. However, as shown in Table 11, no clear pattern emerges. For instance, the BERTouch model achieved an F1 score of 100% on level 3 utterances (where Darija differs the most from MSA), while level 1 and 2 utterances yielded slightly lower but still very high scores of 98% and 97%, respectively.

Table 11. Impact of Linguistic Divergence on F1 Scores Across Intent Detection Methods

In contrast, the e5 retrieval model performed best on level 2 utterances (90%) and struggled more with level 3 (60%). For level 1, the model achieved a score of 70%, indicating a significant drop when moving further away from MSA. Similarly, the GPT-4 prompting model exhibited the highest performance on level 2 utterances (80%), followed by level 1 (74%) and level 3 (60%). This result is consistent with the fact that GPT-4 is primarily trained on high-resource languages, and thus its performance drops when handling linguistically diverse utterances like those in level 3.

When analyzing the easiest and most difficult intents in terms of their mean linguistic divergence levels, we found no apparent link between linguistic variation and classification difficulty. For example, the easiest intents, such as cancel_order, change_order, and get_invoice, had mean divergence levels close to 2, while the most difficult intents, such as loan and general_negative_feedback, exhibited similar divergence levels (ranging between 1,75 and 2,13). Therefore, linguistic divergence does not seem to be a decisive factor in explaining the classification difficulty.

6.6.4 What are the origins of the errors for the most difficult intents?

The errors in classifying the most difficult intents, such as loan, compromised_card, and general_negative_feedback, can largely be explained by two factors: limited examples in the test set and semantic overlap with other intents. For instance, loan only had 12 utterances in the test set, which meant that just two misclassifications were enough to drop its F1 score to below 76%. Furthermore, loan often got confused with fee and age limit, especially in utterances discussing prepayment conditions or eligibility criteria. For example, (What are the costs of early loan repayment?) got mixed with fee, while (What are the eligibility conditions for applying for a loan?) was confused with age limit due to the common term conditions. Additionally, there were cases where loan was misclassified as activate_my_card, further complicating the error analysis.

For the e5 retriever model, similar patterns were observed. For example, general_negative_feedback was often mistaken for delete_account, which is understandable since negative feedback and account deletion often co-occur. Additionally, compromised_card was frequently misclassified as fee (e.g., (I did not visit this shop. These fees are unjustified.)), activate_my_card (e.g., (Can you send me a new card and block the current one? I’m in Spain and my wallet was stolen.)), or delete_account (e.g., (There are suspicious transactions in my account.)) due to common account-related terminology. In total, these errors contributed to an F1 score of 80%.

The GPT-4 prompting model faced similar issues. The compromised_card intent was often confused with get_refund (e.g., (Hello, please address this issue as soon as possible. After I lost my wallet and saw a withdrawal, I don’t want to lose money again.)), cancel_transfer (e.g., (Can I dispute a large payment I found I didn’t make in the last two bills? I know I’m late, please help me.)), and general_negative_feedback (e.g., (I didn’t visit this shop. These fees are unjustified.)), which is understandable given the contextual similarities. Similarly, general_negative_feedback was frequently mixed with delete_account and fee, which is a natural overlap considering the negative sentiment expressed in these utterances.

6.6.5 Why Is the IDOOS intent so challenging to detect?

The IDOOS intent proves difficult for models to detect, as shown by the low F1 scores of 46% for the retrieval-based and GPT-4 approaches, and 85% for the BERTouch model, compared to average scores of 99% for other intents. This aligns with research showing that transformer-based models struggle to distinguish IDOOS queries from in-scope intents due to surface-level similarities (Zhang et al. Reference Zhang, Hashimoto, Wan, Liu, Liu, Xiong and Yu2022). IDOOS queries often contain overlapping keywords with in-scope intents, leading to misclassification. Confidence scores for IDOOS examples are typically close to in-scope examples, further complicating detection. This highlights the need for more specialized approaches to handle IDOOS detection effectively.

In conclusion, this error analysis underscores that while certain intents are consistently easy to detect due to their distinctiveness or higher data volume, others pose challenges, primarily due to semantic overlap and limited data. Moreover, linguistic divergence between Darija and MSA does not appear to be a decisive factor in predicting classification difficulty, suggesting that other elements, such as intent ambiguity or data scarcity, play a larger role. Finally, the poor performance on the IDOOS intent reaffirms the known difficulty transformers face.

7. Discussion

The exploration of three different fine-tuning techniques for monolingual and multilingual models, in both comprehensive and zero-shot scenarios, provides valuable findings for intent detection in Moroccan Darija within the banking sector. These techniques include retrieval-based approaches and prompting with Large Language Models, which have implications for both future research and practical application.

First and foremost, our study demonstrates that relying solely on cross-lingual transfer learning might not be the most effective approach to address the complexities of Moroccan Darija, which contrasts with the findings of previous studies where cross-lingual transfer was often sufficient for intent detection tasks in other languages. The initial results reveal a significant variance in the F1 scores between MSA and Darija, underscoring the need for dedicated linguistic resources tailored to dialectical Arabic. This gap significantly narrows with the comprehensive integration of languages and fine-tuning, as shown by the enhanced performance of models like Arabertv02-twitter when trained on datasets that encompass Darija. This emphasizes the value of investing in high-quality domain-specific data labeling to improve model accuracy and efficiency, especially in bounded domains such as banking where the complexity and size of the model can be balanced with targeted, high-quality data. Notably, the fine-tuned version of AraBERTv02-twitter, referred to as BERTouch, emerges as the best-performing model for intent detection in DarijaBanking, further highlighting the importance of domain-specific high-quality data for specialized tasks.

Furthermore, the findings urge caution against an overreliance on powerful LLMs for specialized classification tasks such as intent detection. While previous studies have demonstrated that LLMs can effectively perform intent detection tasks even in few-shot settings for other languages, models like ChatGPT-3.5 and GPT-4 exhibit limitations in accurately classifying intents within the DarijaBanking dataset. This highlights the importance of incorporating specialized classifiers for the intention classification module in banking chatbots, where precision in understanding customer queries is paramount. LLMs, while beneficial in generating human-like responses and enhancing the chatbot’s conversational capabilities, should complement rather than replace dedicated classifiers fine-tuned for specific intents.

However, the study also acknowledges the economic and logistical constraints of data labeling, recognizing that it can be prohibitively expensive or resource-intensive for some organizations. In such scenarios, retrieval-based approaches using pre-trained text embeddings, such as OpenAI’s “text-embedding-3-large” or the “multilingual-e5-base” model, offer a viable alternative. This approach aligns with previous research advocating for efficient, feature-based methods over full fine-tuning, showing that it is possible to achieve respectable performance while minimizing computational demands. These models provide a practical solution for intent detection, balancing cost and accuracy effectively.

Despite these advances, it is important to acknowledge the limitations of the study. The intent list used is not exhaustive, and real-world applications will likely require further adjustments to accommodate the full spectrum of banking queries. Moreover, the current approach does not account for the contextual and multi-intent nature of real-world conversations, which could provide valuable signals for more accurate intent classification.

8. Conclusion

In conclusion, this paper introduces DarijaBanking, an innovative dataset designed to enhance intent detection within the banking sector for both Moroccan Darija and MSA. Through the adaptation and refinement of English banking datasets, DarijaBanking emerges as a valuable tool for nuanced language processing, comprising over 3600 queries across 24 intent classes. Our analysis, spanning model fine-tuning, zero-shot learning, retrieval-based techniques, and Large Language Model prompting, underscores the importance of tailored approaches in processing dialect-specific languages and underscores the effectiveness of ArabertV0.2 when fine-tuned on this dataset.

The research emphasizes the importance of domain-specific classifiers and the limitations of relying solely on general-purpose Large Language Models for precise intent detection. It also presents retrieval-based approaches as practical, cost-effective alternatives for scenarios where data labeling poses significant economic and logistical challenges. These approaches provide a pragmatic balance between performance and resource allocation, facilitating the advancement of AI-driven solutions in settings that are linguistically diverse and resource-limited.

However, we recognize the limitations of our study, including the non-exhaustive nature of our intent list and the lack of consideration for the contextual and multi-intent dynamics of real-world interactions. These aspects offer avenues for future research to further refine and enhance the application of AI in understanding and servicing the diverse needs of banking customers.

By contributing to the development of resources like DarijaBanking, this paper aims to support the broader goal of making AI technologies more adaptable and effective across various linguistic contexts. In doing so, we hope to inspire continued efforts towards creating more inclusive digital banking solutions and advancing the field of NLP.

Footnotes

a Most of the examples given below are related to the financial domain

b https://github.com/abderrahmanskiredj/DarijaBanking

c https://huggingface.co/datasets/mteb/banking77

d https://github.com/MrJay10/banking-faq-bot

e https://github.com/Brijeshlakkad/smart-banking-chatbot

References

Abdelali, A., Hassan, S., Mubarak, H., Darwish, K. and Samih, Y. (2021). Pre-training bert on arabic tweets: Practical considerations. arXiv preprint arXiv:2102.10684.Google Scholar

Abdul-Mageed, M., Elmadany, A. and Nagoudi, E.M.B. (2021). ARBERT & MARBERT: Deep bidirectional transformers for Arabic. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online. Association for Computational Linguistics, pp. 7088–7105.Google Scholar

Adamopoulou, E. and Moussiades, L. (2020). An overview of chatbot technology. In Maglogiannis I., Iliadis L. and Pimenidis E. (eds), Artificial Intelligence Applications and Innovations. Cham: Springer International Publishing, pp. 373–383.Google Scholar

Algotiml, B., Elmadany, A. and Magdy, W. (2019). Arabic tweet-act: Speech act recognition for arabic asynchronous conversations. In Proceedings of the Fourth Arabic Natural Language Processing Workshop. Florence, Italy: Association for Computational Linguistics, pp. 183–191.Google Scholar

Antoun, W., Baly, F. and Hajj, H. (2020). AraBERT: Transformer-based model for Arabic language understanding. In Al-khalifa, H., Magdy, W., Elsayed, T. and Mubarak, H. (eds), Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection. Marseille, France: European Language Resource Association, pp. 9–15.Google Scholar

Artetxe, M. and Schwenk, H. (2019). Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Transactions of the Association for Computational Linguistics 7, 597–610.CrossRef Google Scholar

Basu, S., Sharaf, A., Ip Kiun Chong, K., Fischer, A., Rohra, V., Amoake, M., El–Hammamy, H., Nosakhare, E., Ramani, V. and Han, B. (2022). Strategies to improve few-shot learning for intent classification and slot-filling. In Proceedings of the Workshop on Structured and Unstructured Knowledge Integration (SUKI), Association for Computational Linguistics, pp. 17–25.CrossRef Google Scholar

Benchiba–Savenius, N. (2011). A Structural Analysis of Moroccan Arabic and English Intra-Sentential Code Switching. München: Lincom Europa.Google Scholar

Bilah, C.O., Adji, T.B. and Setiawan, N.A. (2022). Intent detection on indonesian text using convolutional neural network. In 2022 IEEE International Conference on Cybernetics and Computational Intelligence (CyberneticsCom), pp. 174–178.CrossRef Google Scholar

Boujou, E., Chataoui, H., Mekki, A. E., Benjelloun, S., Chairi, I. and Berrada, I. (2023). Letz translate: Low-resource machine translation for luxembourgish. In 2023 5th International Conference on Natural Language Processing (ICNLP), IEEE, pp. 165–170.Google Scholar

Boujou, E.e a. (2021). An open access nlp dataset for arabic dialects: Data collection, labeling, and model construction, arXiv preprint arXiv:2102.11000.Google Scholar

Casanueva, I.,Temcinas, T., Gerz, D., Handerson, M. and Vulic, I. (2020a)). Efficient intent detection with dual sentence encoders. In Wen, T.H., Celikyilmaz, A., Yu, Z., Papangelis, A., Eric, M., Kumar, A., Casanueva, I. and Shah, R., (eds), Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI. Online. Association for Computational Linguistics, pp. 38–45.Google Scholar

Casanueva, I., Temcinas, T., Gerz, D., Henderson, M. and Vulic, I. (2020b). Efficient intent detection with dual sentence encoders. In Wen, T.H., Celikyilmaz, A., Yu, Z., Papangelis, A., Eric, M., Kumar, A., Casanueva, I. and Shah, R., (eds), Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI. Online. Association for Computational Linguistics, pp. 38–45 .Google Scholar

Cer, D., Yang, Y., Kong, S.-y., Hua, N., Limtiaco, N., St John, R., Constant, N., Guajardo-Cespedes, M., Yuan, S., Tar, C., Sung, Y.-H., Strope, B. and Kurzweil, R. (2018). Universal sentence encoder. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics, pp. 169–174.Google Scholar

Chidambaram, M., Yang, Y., Cer, D., Yuan, S., Sung, Y.-H., Strope, B. and Kurzweil, R. (2019). Learning cross-lingual sentence representations via a multi-task dual-encoder model. In Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), Association for Computational Linguistics, pp. 250–259.Google Scholar

Communication, S., Barrault, L., Chung, Y.-A., Meglioli, M. C., Dale, D., Dong, N., Duquenne, P.-A., Elsahar, H., Gong, H., Heffernan, K., Hoffman, J., Klaiber, C., Li, P., Licht, D., Maillard, J., Rakotoarison, A., Sadagopan, K. R., Wenzek, G., Ye, E., Akula, B., Chen, P.-J., Hachem, N. E., Ellis, B., Gonzalez, G. M., Haaheim, J., Hansanti, P., Howes, R., Huang, B., Hwang, M.-J., Inaguma, H., Jain, S., Kalbassi, E., Kallet, A., Kulikov, I., Lam, J., Li, D., Ma, X., Mavlyutov, R., Peloquin, B., Ramadan, M., Ramakrishnan, A., Sun, A., Tran, K., Tran, T., Tufanov, I., Vogeti, V., Wood, C., Yang, Y., Yu, B., Andrews, P., Balioglu, C., Costa-jussà, M., Celebi, O., Elbayad, M., Gao, C., Guzmán, F., Kao, J, Lee, A., Mourachko, A., Pino, J., Popuri, S., Ropers, C., Saleem, S., Schwenk, H., Tomasello, P., Wang, C., Wang, J. and Wang, S. (2023). Seamlessm4t: Massively multilingual & multimodal machine translation.Google Scholar

Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L. and Stoyanov, V. (2020). Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online. Association for Computational Linguistics, pp. 8440–8451.Google Scholar

Darwish, K., Habash, N., Abbas, M., Al-Khalifa, H., Al-Natsheh, H. T., Bouamor, H., Bouzoubaa, K., Cavalli-Sforza, V., El-Beltagy, S. R., El-Hajj, W., Jarrar, M. and Mubarak, H. (2021). A panoramic survey of natural language processing in the Arab world. Communications of the ACM 64(4), 72–81.Google Scholar

Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, pp. 4171–4186,Google Scholar

El Mekki, A., El Mahdaouy, A., Berrada, I. and Khoumsi, A. (2022). Adasl: an unsupervised domain adaptation framework for arabic multi-dialectal sequence labeling. Information Processing & Management 59(4), 102964.CrossRef Google Scholar

El Mekki, A., El Mahdaouy, A., Essefar, K., El Mamoun, N., Berrada, I. and Khoumsi, A. (2021). Bert-based multi-task model for country and province level msa and dialectal arabic identification. In Proceedings of the sixth Arabic natural language processing workshop, pp. 271–275.Google Scholar

Elmadany, A., Mubarak, H. and Magdy, W. (2018). Arsas: An Arabic speech-act and sentiment corpus of tweets. OSACT 3, 20.Google Scholar

Essefar, K., Baha, H., Mahdaouy, A., el Mekki, A. and Berrada, I. (2023). Omcd: Offensive moroccan comments dataset. Language Resources and Evaluation 57(4), 1–21.CrossRef Google Scholar

Feng, F., Yang, Y., Cer, D., Arivazhagan, N. and Wang, W. (2020). Language-agnostic bert sentence embedding.Google Scholar

Gerz, D., Su, P.H., Kusztos, R., Mondal, A., Lis, M., Singhal, E., Mrksic, N., Wen, T.H. and Vulic, I (2021). Multilingual and cross-lingual intent detection from spoken data, Moens M.F., Huang X., Specia L. and Yhi S.W.t. (eds), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, pp. 7468–7475.CrossRef Google Scholar

Henderson, M., Casanueva, I., Mrkšić, N., Su, P.-H., Wen, T.-H. and Vulić, I. (2020). Convert: Efficient and accurate conversational representations from transformers. In Findings of the Association for Computational Linguistics: EMNLP. Association for Computational Linguistics, pp. 2161–2174.Google Scholar

Hijjawi, M., Bandar, Z. and Crockett, K. (2013). User’s utterance classification using machine learning for arabic conversational agents. In 2013 5th International Conference on Computer Science and Information Technology, pp. 223–232.CrossRef Google Scholar

Hijjawi, M., Bandar, Z., Crockett, K. and Mclean, D. (2014). Arabchat: An arabic conversational agent. In 2014 6th International Conference on Computer Science and Information Technology (CSIT), pp. 227–237.Google Scholar

HuggingFace (2024a). AdamW implementation. HuggingFace documentation and GitHub repository. Available at: https://github.com/bitsandbytes-foundation/bitsandbytes/blob/main/bitsandbytes/optim/adamw.py (Version 3ec3dd2). Accessed: May 2024.Google Scholar

HuggingFace (2024b). Transformers training args. GitHub repository. Available at: https://github.com/huggingface/transformers/blob/main/src/transformers/training_args.py (Version 80126f9). Accessed: May 2024.Google Scholar

Inoue, G., Alhafni, B., Baimukan, N., Bouamor, H. and Habash, N. (2021). The interplay of variant, size, and task type in arabic pre-trained language models. In Proceedings of the Sixth Arabic Natural Language Processing Workshop. Kyiv, Ukraine (Online): Association for Computational Linguistics.Google Scholar

Jarrar, M., Birim, A., Khalilia, M., Erden, M. and Ghanem, S. (2023). ArBanking77: Intent detection neural model and a new dataset in modern and dialectical Arabic. In Sawaf H.e.a. (eds), Proceedings of ArabicNLP 2023. Singapore (Hybrid): Association for Computational Linguistics, pp. 276–287.CrossRef Google Scholar

Jörg, T. and Santhosh, T. (2020). OPUS-MT – Building open translation services for the World. In Proceedings of the 22nd Annual Conferenec of the European Association for Machine Translation (EAMT), Lisbon, Portugal.Google Scholar

Joukhadar, A., Saghergy, H., Kweider, L. and Ghneim, N. (2019). Arabic dialogue act recognition for textual chatbot systems. In Proceedings of The First International Workshop on NLP Solutions for Under Resourced Languages (NSURL 2019) colocated with ICNLSP 2019-Short Papers, pp. 43–49.Google Scholar

Karajah, E., Arman, N. and Jarrar, M. (2021). Current trends and approaches in synonyms extraction: Potential adaptation to arabic. In 2021 International Conference on Information Technology (ICIT), pp. 428–434.Google Scholar

Lakkad, B. (2018). Smart-Banking-Chatbot. GitHub repository. Available at: https://github.com/Brijeshlakkad/smart-banking-chatbot (Version 39407a6). Accessed: April 2024.Google Scholar

Laoudi, J., Bonial, C., Donatelli, L., Tratz, S. and Voss, C. (2018). Towards a computational lexicon for Moroccan Darija: Words, idioms, and constructions. In Savary A., Bonial C., Donatelli L., Tratz S. and Voss C. (eds), Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018). Santa Fe, New Mexico, USA: Association for Computational Linguistics, pp. 74–85.Google Scholar

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., tau Yih, W., Rocktäschel, T., Riedel, S. and Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive nlp tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA: Curran Associates Inc.Google Scholar

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L. and Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach .Google Scholar

Loshchilov, I. and Hutter, F. (2019). Decoupled weight decay regularization.Google Scholar

Loukas, L., Stogiannidis, I., Diamantopoulos, O., Malakasiotis, P. and Vassos, S. (2023). Making llms worth every penny: Resource-limited text classification in banking. In Proceedings of the Fourth ACM International Conference on AI in Finance, ICAIF ’23. New York, NY, USA: Association for Computing Machinery, pp. 392–400.CrossRef Google Scholar

Mezzi, R., Yahyaoui, A., Krir, M. W., Boulila, W. and Koubaa, A. (2022). Mental health intent recognition for arabic-speaking patients using the mini international neuropsychiatric interview (mini) and bert model. Sensors 22(3), 846.Google Scholar PubMed

Nagoudi, E.M.B., Elmadany, A. and Abdul-Mageed, M. (2022). TURJUMAN: A public toolkit for neural Arabic machine translation. In Al-Khalifa H., Elsayed T., Mubarak H., Al-Thubaity A., Magdy W. and Darwish K. (eds), Proceedinsg of the 5th Workshop On Open-Source Arabic Corpora and Processing Tools with Shared Tasks On Qur’an QA and Fine-Grained Hate Speech Detection. Marseille, France: European Language Resources Association, pp. 1–11.Google Scholar

OpenAI (2022). Introducing ChatGPT. OpenAI Website. Available at: https://openai.com/index/chatgpt/. Accessed: March 2023.Google Scholar

OpenAI (2024a). GPT-4 Technical Report. arXiv:2303.08774. Accessed: March 2024.Google Scholar

OpenAI (2024b). New embedding models and API updates. OpenAI Website. Available at: https://openai.com/index/new-embedding-models-and-api-updates/. Accessed: March 2024.Google Scholar

Patel, J. (2017). banking-faq-bot. GitHub repository. Available at: https://github.com/MrJay10/banking-faq-bot (Version 30c5e81). Accessed: May 2024.Google Scholar

PyTorch (2023). AdamW implementation. PyTorch Documentation. Available at: https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html Accessed: May 2023.Google Scholar

Reimers, N. and Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. Conference on Empirical Methods in Natural Language Processing (EMNLP).CrossRef Google Scholar

Reimers, N. and Gurevych, I. (2020a). Making monolingual sentence embeddings multilingual using knowledge distillation. arXiv preprint, abs/2004.09813.Google Scholar

Reimers, N. and Gurevych, I. (2020b). distiluse-base-multilingual-cased-v1 Model. HuggingFace Website. Available at: https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1. Accessed: March 2024.Google Scholar

Reimers, N. and Gurevych, I. (2020c). paraphrase-multilingual-MiniLM-L12-v2 Model. HuggingFace Website. Available at: https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2. Accessed: March 2024.Google Scholar

Reimers, N. and Gurevych, I. (2020d). paraphrase-multilingual-mpnet-base-v2 Model. HuggingFace Website. Available at: https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2. Accessed: March 2024.Google Scholar

Sanad, M., Mo, E.-H., Saad, E., Mohammed, K., Mustafa, J., Sultan, A., Ismail, B. and Houda, B. (2024). Arafinnlp 2024: The first arabic financial nlp shared task. arXiv preprint, abs/2407.09818.Google Scholar

Sengupta, N., Sahu, S. K., Jia, B., Katipomu, S., Li, H., Koto, F., Marshall, W., Gosal, G., Liu, C., Chen, Z., Afzal, O. M., Kamboj, S., Pandit, O., Pal, R., Pradhan, L., Mujahid, Z. M., Baali, M., Han, X., Bsharat, S. M., Aji, A. F., Shen, Z., Liu, Z., Vassilieva, N., Hestness, J., Hock, A., Feldman, A., Lee, J., Jackson, A., Ren, H. X., Nakov, P., Baldwin, T. and Xing, E. (2023). Jais and jais-chat: Arabic-centric foundation and instruction-tuned open generative large language models. arXiv preprint, abs/2308.16149.Google Scholar

Shams, S. and Aslam, M. (2022). Improving user intent detection in urdu web queries with capsule net architectures. Applied Sciences 12(22), 11861.CrossRef Google Scholar

Shams, S., Aslam, M. and Martinez-Enriquez, A. (2019). Lexical intent recognition in urdu queries using deep neural networks. In Advances in Soft Computing. Cham: Springer International Publishing, pp. 39–50.Google Scholar

Skiredj, A., Azhari, F., Berrada, I. and Ezzini, S. (2024a). BERTouch Model. HuggingFace Website. Available at: https://huggingface.co/AbderrahmanSkiredj1/BERTouch. Accessed: May 2024.Google Scholar

Skiredj, A., Azhari, F., Berrada, I. and Ezzini, S. (2024b). DarijaBanking Dataset. HuggingFace Website. Available at: https://huggingface.co/AbderrahmanSkiredj1/BERTouch. Accessed: May 2024.Google Scholar

Wang, L., Yang, N., Huang, X., Jiao, B., Yang, L., Jiang, D., Majumder, R. and Wei, F. (2022). Text embeddings by weakly-supervised contrastive pre-training. ArXiv, abs/2212.03533.Google Scholar

Xi, Z., Chen, W., Guo, X., He, W., Ding, Y., Hong, B., Zhang, M., Wang, J., Jin, S., Zhou, E., Zheng, R., Fan, X., Wang, X., Xiong, L., Zhou, Y., Wang, W., Jiang, C., Zou, Y., Liu, X., Yin, Z., Dou, S., Weng, R., Cheng, W., Zhang, Q., Qin, W., Zheng, Y., Qiu, X., Huang, X. and Gui, T. (2023). The rise and potential of large language model based agents: A survey. arXiv preprint, abs/2309.07864.Google Scholar

Xu, Y., Shieh, C.-H., van Esch, P. and Ling, I.-L. (2020). AI customer service: Task complexity, problem-solving ability, and usage intention. Australasian Marketing Journal (AMJ) 28(4), 189–199.Google Scholar

Zhang, J., Bui, T., Yoon, S., Chen, X., Liu, Z., Xia, C., Tran, Q.H., Chang, W. and Yu, P. Few-shot intent detection via contrastive pre-training and fine-tuning . In Moens M.F., Huang X., Specia L. and Yih S.W.-t. (eds), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, pp. 1906–1912.Google Scholar

Zhang, J., Hashimoto, K., Wan, Y., Liu, Z., Liu, Y., Xiong, C., and Yu, P. (2022). Are pre-trained transformers robust in intent classification? a missing ingredient in evaluation of out-of-scope intent detection. In Liu B., Papangelis A., Ultes S., Rastogi A., Chen Y.N., Spithourakis G., Nouri E. and Shi W. (eds), Proceedings of the 4th Workshop on NLP for Conversational AI. Dublin, Ireland: Association for Computational Linguistics., pp. 12–20.Google Scholar

Zhang, J., He, T., Sra, S., and Jadbabaie, A. (2020). Why gradient clipping accelerates training: A theoretical justification for adaptivity. International Conference on Learning Representations (ICLR 2020).Google Scholar

Zhou, Y., Liu, P. and Qiu, X. (2022). Knn-contrastive learning for out-of-domain intent classification. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, pp. 5129–5141.Google Scholar

Table 1. Some examples of manually corrected translations from English to Darija

Table 2. Comprehensive Intent Catalogue

Table 3. Mean linguistic differences per intent between Darija and MSA, rated on a scale from 1 (very close) to 3 (far apart)

Table 4. Statistics of DarijaBanking dataset

Table 5. Performance of XLM-Roberta Zero-Shot Learning: Gains from Sequential Language Integration in the DarijaBanking Dataset

Table 6. Performance of various pre-trained transformers on DarijaBanking

Table 7. Performance of various pre-trained Retrievers as Intent Detectors on DarijaBanking

Table 8. Performance of LLMs as Intent Detectors on DarijaBanking

Table 9. Results of the NMT pipeline on both Darija and MSA

Table 10. Comparative F1 scores of three models across various customer intents

Table 11. Impact of Linguistic Divergence on F1 Scores Across Intent Detection Methods

Article contents

DarijaBanking: A new resource for overcoming language barriers in banking intent detection for Moroccan Arabic speakers

Abstract

Keywords

1. Introduction

2. Related work

3. Background on Moroccan Darija

3.1 Grammar and syntax

3.2 Vocabulary and loanwords

3.3 Idiomatic expressions and colloquial terms

3.4 Pronouns and question words

3.5 Independent pronouns ()

3.6 Dependent pronouns ()

3.7 Verb conjugation and tense structures

3.8 Prepositions and possessive structures

3.9 Relative clauses

4. The DarijaBanking corpus

4.1 Data collection

4.1.1 Phase I: Cleaning

4.1.2 Phase II: The two additional intents IDOOS and OODOOS

4.1.2.1 IDOOS examples

4.1.2.2 OODOOS examples

4.1.3 Phase III: Automatic translation

4.1.4 Phase IV: Manual verification & correction

4.2 Comprehensive intent catalogue

4.3 Data segmentation and descriptive analysis

4.4 Labeling Darija-MSA linguistic divergence in the test set

5. Intent detection approaches

5.1 BERT-like models fine-tuning

5.2 Retrieval-based intent detection

5.2.1 Neural architectures for text embedding models

5.2.2 Arabic text embedding models

5.2.3 Multilingual text embedding models supporting Arabic

5.3 Intent detection by LLM prompting

5.4 Intent detection through NMT pipeline

6. Experiments and results

6.1 Experimental setup and hyperparameters

6.2 BERT-like models fine-tuning results

6.2.1 Zero-shot cross-lingual transfer learning

6.2.2 Comprehensive fine-tuning of pre-trained transformers

6.3 Retrieval-based intent detection results

6.4 Intent detection by LLM prompting results

6.5 Intent detection through NMT pipeline

6.6 Error analysis

6.6.1 Are some intents easier to predict across all experiments?

6.6.2 Are there intents that are particularly difficult to predict across all experiments?

6.6.3 Are lower F1 scores for a given intent linked to linguistic divergence between Darija and MSA?

6.6.4 What are the origins of the errors for the most difficult intents?

6.6.5 Why Is the IDOOS intent so challenging to detect?

7. Discussion

8. Conclusion

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests