1. Introduction
Dialog systems are widely used in various personal assistants such as Google Assistant, Amazon Alexa, Apple Siri, and Microsoft Cortana. Natural language understanding (NLU) plays an essential role in enabling users to accomplish their tasks through verbal interactions. NLU typically involves two tasks: intent detection and slot filling. In particular, intent detection aims to identify a speaker’s intent from a given utterance which can be treated as a sentence classification problem. In contrast, slot filling extracts the correct argument values from the utterance for intents slots, which can be treated as a sequence labeling task that maps an input word sequence into the corresponding slot tags sequence. Several joint learning methods have been proposed to improve performance over independent models to model and exploit the relationship between intent detection and slot filling. Figure 1 demonstrates a typical sample from the Airline Travel Information System (ATIS) (Tur, Hakkani-Tür, and Heck Reference Tur, Hakkani-Tür and Heck2010) training set in which the slot filling task is labeled with the IOB representation.
NLU is eventually required in many languages, most of which do not have large annotated training datasets. In response to the lack of human-labeled data, various methods were developed to train general-purpose language representation models from a large set of unannotated texts, such as Word2Vec (Mikolov et al. Reference Mikolov, Chen, Corrado and Dean2013) and Glove (Pennington, Socher, and Manning Reference Pennington, Socher and Manning2014). Pretrained models can be fine-tuned on natural language processing (NLP) tasks and have significantly improved over training on task-specific annotated data. More recently, contextualized word representations such as ELMo (Peters et al. Reference Peters, Neumann, Iyyer, Gardner, Clark, Lee and Zettlemoyer2018), Generative Pre-trained Transformer (GPT) (Radford et al. Reference Radford, Narasimhan, Salimans and Sutskever2018), and Bidirectional Encoder Representations from Transformers (BERT) (Kenton and Toutanova Reference Kenton and Toutanova2019) were proposed and have created state-of-the-art models for a wide variety of NLP tasks. Although these pretrained models significantly improved the performance of different NLP tasks, fine-tuning the models with labeled data is still an important issue.
An aspect of generalizability refers to whether a model can be applied outside of the language in which it has been trained. Therefore, a transfer learning approach from a rich-resource language to a low-resource language would be desirable. In addition, developing cross-lingual transfer methods for intent detection and slot filling is challenging due to the lack of multilingual datasets that have been annotated according to the same guidelines. In this work, we use the ATIS dataset, which contains thousands of English utterances (the rich-resource data), and a novel dataset of Persian utterances (the low-resource data), annotated according to a similar annotation scheme. These data make it possible to examine cross-linguistic transfer learning methods from rich-resource language to a low-resource language. We aim to investigate if it is feasible to achieve a cross-lingual joint model using multiple phases of fine-tuning in training while outperforming the monolingual models. To summarize, the key contributions of this paper are as follows:
-
To produce our low-resource data, we manually translated and annotated 1462 samples of the ATIS dataset from English to Persian.
-
We explore the performance of current cross-lingual pretrained language models such as multilingual BERT (mBERT) (Kenton and Toutanova Reference Kenton and Toutanova2019) and XLM-RoBERTa (Conneau et al. Reference Conneau, Khandelwal, Goyal, Chaudhary, Wenzek, Guzmán, Grave, Ott, Zettlemoyer and Stoyanov2020) to address the lack of multilingual human-labeled data.
-
We fine-tuned the JointBERT+CRF (Chen, Zhuo, and Wang Reference Chen, Zhuo and Wang2019) model with different scenarios in the training phase and reported the results.
The structure of the paper is as follows: in Section 2, we review and discuss related research works concisely. In Section 3, our proposed model and training scenarios are presented. The dataset statistics, evaluation metrics, and experiments setup are explained in Section 4. Detailed experimental results and analysis are given in Section 5, and finally, our conclusions are presented in Section 6.
2. Related works
2.1 Related works on joint intent detection and slot filling models
Pipelining the two subtasks has produced accurate results but is prone to error propagation. Two tasks, slot filling and intent detection, can be solved simultaneously by a joint model.
In many cases, joint distributions implicitly model via joint loss back-propagation (Guo et al. Reference Guo, Tur, Yih and Zweig2014; Hakkani-Tür et al. Reference Hakkani-Tür, Tür, Celikyilmaz, Chen, Gao, Deng and Wang2016; Liu and Lane, Reference Liu and Lane2016). It would be beneficial to capture the relationship between intents and slots explicitly (Han et al. Reference Han, Long, Li, Weld and Poon2021).
A joint capsule neural network is proposed by Zhang et al. (Reference Zhang, Li, Du, Fan and Philip2019), which uses a dynamic routing-by-agreement scheme between capsule layers. In dynamic routing-by-agreement, it explicitly models words, slots, and intents at the utterance level.
Qin et al. (Reference Qin, Che, Li, Wen and Liu2019) performed the token-level intent detection to improve the robustness of intent detection and proposed a stack-propagation framework that incorporates intent information to guide the slot filling.
Zhu et al. (Reference Zhu, Cao and Yu2020) used a dual model for joint intent detection and slot filling to generate sentences based on structured semantic forms. They performed a novel framework for semi-supervised NLU by incorporating the dual model in order to take advantage of unlabeled data.
Yang et al. (Reference Yang, Ji, Ai and Li2021) proposed a joint model of intent detection and slot filling based on a position-aware multi-head masked attention mechanism. The explicit feature interactions are modeled as the inner product of the word encoding vector and the intent-slot feature vectors. Wang et al. (Reference Wang, Shen and Jin2018) applied a new bidirectional recurrent neural networks (Bi-RNN) model to jointly perform the intent detection and slot filling tasks by considering their cross-impact using two correlated bidirectional long short-term memories (Bi-LSTM). Tang et al. (Reference Tang, Ji and Zhou2020) proposed a graph-based conditional random field (CRF) for modeling the implicit connections within slot–slot and slot–intent pairs and solved the incompatibility between slot tags and intent tags by employing a mask mechanism. Chen et al. (Reference Chen, Zhuo and Wang2019) proposed to use the contextual BERT model to learn the two tasks jointly.
Table 1 presents an overview of the results of the existing models and reports F1 score, accuracy, and exact match (EM) on both SNIPS and ATIS datasets.
2.2 Related works on cross-lingual models
English can be used in conjunction with low-resource languages to address the lack of annotated data, which has become a popular topic recently. Castellucci et al. (Reference Castellucci, Bellomaria, Favalli and Romagnoli2019) considered transfer learning from English to Italian. Upadhyay et al. (Reference Upadhyay, Faruqui, Tür, Dilek and Heck2018) leveraged multilingual word embeddings that share a common vector space across various languages to do zero-shot and almost zero-shot transfer learning in intent detection and slot filling. They translated the ATIS English dataset into Turkish and Hindi. Xu et al. (Reference Xu, Haider and Mansour2020) proposed an end-to-end approach for jointly aligning and predicting target slot labels for cross-lingual transfer. They released Multi-ATIS++, a multilingual NLU corpus with six new languages: Spanish, German, French, Portuguese, Chinese, and Japanese.
Artetxe et al. (Reference Artetxe, Labaka and Agirre2017) developed a method to decrease reliance on large bilingual dictionaries using smaller seed dictionaries. Their approach involved a self-learning framework that can be used in conjunction with any dictionary-based mapping technique. They learned how to map source and target word embeddings through a small word dictionary. He et al. (Reference He, Xu and Yan2020) examined the effectiveness of dividing slot tagging models into the language-shared part and language-specific parts to transfer cross-lingual knowledge and improve monolingual slot tagging. Moreover, they refined shared knowledge with language discriminators and reinforce information separation through adversarial training.
Gritta and Iacobacci (Reference Gritta and Iacobacci2021) used translated task data to encourage the model to generate similar sentence embeddings for different languages. Gritta et al. (Reference Gritta, Hu and Iacobacci2022) introduced CrossAligner, a cross-lingual transfer method that converts training data in English into a task that can be applied to any language. This task is used to synchronize model predictions across different languages. They have also presented a contrastive alignment method that reduces the cosine distance between translated sentences while increasing it for unrelated sentences. This new method requires significantly less data compared to previous works. Additionally, they have suggested Translate-Intent as a simple and efficient baseline approach that surpasses previous Translate-Train methods without using error-prone data transformations like slot label projection.
Schuster et al. (Reference Schuster, Gupta, Shah and Lewis2019) studied transfer to low-resource languages, from English to Spanish and Thai. Three methods of cross-linguistic transfer have been used: translation of training data, cross-lingual pretrained embeddings, and a multilingual machine translation encoder as contextual vectors (CoVe) (McCann et al. Reference McCann, Bradbury, Xiong and Socher2017) for word representations. Liu et al. (Reference Liu, Shin, Xu, Winata, Xu, Madotto and Fung2019) proposed an attention-informed mixed-language training instead of manually selecting the word pairs, and they proposed to extract source words based on the scores computed by the attention layer of a trained English task-related model and then generate word pairs using existing bilingual dictionaries. Qin et al. (Reference Qin, Ni, Zhang and Che2021) proposed an augmentation framework to generate multilingual code-switching data to fine-tune mBERT for aligning representations from rich-resource and multiple low-resource languages by mixing their context information.
Using cross-lingual transfer, the model outperforms training on limited data from the low-resource language. However, some challenges remain to deal with, for example, aligning intents and slots from source and target languages in different scenarios, such as differences in syntax or grammar between languages or translation gaps where certain words may not have an equivalent translation. Additionally, idiomatic expressions, cultural differences, and regional dialects can make it challenging to align intents and slots, as they can vary significantly between languages and regions. Moreover, ensuring model generalizability across different languages and language families remains an area of research that needs exploration.
These challenges motivated us to propose model for cross-lingual intent detection and slot filling which benefit from the advantages of large amount of data in rich-resource languages while still using limited data from the target low-resource language to better learn the features of the target language.
3. Proposed model
3.1 JointBERT+CRF
As mentioned, we used the JointBERT+CRF model to test different scenarios. The overall structure of the model is presented in Figure 2. In this figure, $FFNN_{ID}$ and $FFNN_{SF}$ denote the feed forward neural networks which consists of a single linear layer and are used in the last layer of the architecture for intent detection and slot filling, respectively. As shown in Figure 2, it consists of three layers: an encoding layer and two decoding layers of intent detection and slot filling.
Encoding layer: A pretrained multilayer bidirectional transformer encoder named BERT is employed in the encoding layer to produce contextualized latent feature embeddings. Tokens are inserted as follows: a classification embedding ([CLS]) as the first token and a unique token ([SEP]) as the last token. For a sequence of input tokens $x$ $=$ ( $x^{1}$ ,…, $x^{T}$ ), the output of BERT is $H$ $=$ ( $h_{1}$ ,…, $h_{T}$ ). In the BERT model, two strategies are employed to pretrain the model on large-scale unlabeled texts: the masked language model (MLM) and next sentence prediction (NSP). The cross-lingual BERT models, specifically, mBERT (Kenton and Toutanova Reference Kenton and Toutanova2019) and XLM-RoBERTa (Conneau et al. Reference Conneau, Khandelwal, Goyal, Chaudhary, Wenzek, Guzmán, Grave, Ott, Zettlemoyer and Stoyanov2020), provide powerful context-dependent sentence representations that can be used for cross-lingual tasks, including intent detection and slot filling by fine-tuning.
Intent detection layer: The BERT model can easily be extended to a joint intent detection and slot filling model. We feed the hidden states of the unique token [CLS] denoted as $h_{1}$ , into a feed forward layer and then passed to a softmax layer, The intent is predicted as follows:
Slot filling layer: For slot filling, first we feed the final hidden states of other tokens $h_{2}$ ,…, $h_{T}$ into a feed forward layer and then it will be passed into a softmax layer to classify over the slot labels. Since BERT tokenizes each input token into multiple sub-tokens by using WordPiece tokenization, we only use the hidden states corresponding to the first sub-token as the input to the slot decoder:
where $h_n$ is the hidden state corresponding to the first sub-token of word $x_n$ .
CRF layer: The predictions for slot labels are influenced by the surrounding words. As shown by Chen et al. (Reference Chen, Zhuo and Wang2019), structured prediction models, such as CRFs, can improve slot filling performance. In this case, we added CRF to model slot label dependencies on top of the joint BERT model. Given an input sentence $x$ of length $L$ and the tag scores $y$ , the final score of a sequence of tags $z$ is calculated as follows:
In the transition matrix $A$ , $A_{p,q}$ represents the binary score of transitioning from tag $p$ to tag $q$ , and $y_{t,z_t}$ represents the unary score of assigning tag $z$ to the $t^{th}$ word. During the training phase, we aim to maximize the following objective function given the ground truth sequence of tags $z$ :
All paths for tagging can be represented by $Z$ .
Joint training: The joint model has a training objective loss ( $L$ ), which is the weighted sum of the intent detection loss $L_{ID}$ and the slot filling loss $L_{SF}$ :
The hyperparameter $\lambda$ represents the combination weight: $0$ $\lt$ $\lambda$ $\lt$ $1$ .
3.2 Training scenarios
Considering the cross-lingual approach in our proposed model, we aim to perform various training scenarios in order to train our models in English (EN) and Persian (PR). In all experiments, we consider that a large amount of samples are available in the English training data, while the Persian training data includes a small amount of samples.
The training scenarios are as follows:
-
PR: In this model, we only use Persian training data, that is, the joint intent-slot model is fine-tuned in one step using the Persian data.
-
EN: This model only uses English training data as our high-resource data.
-
PR $\to$ EN: The model is first trained on the Persian training data and then on the English training data.
-
EN $\to$ PR: The model is trained on English training data and Persian training data, respectively. Figure 3 provides a comprehensive view of this scenario.
-
EN + PR: A combination of English and Persian training data is used to train the model. Figure 4 represents this scenario.
4. Evaluation
4.1 Dataset
To evaluate different scenarios, we used two benchmark datasets:
-
ATIS: The ATIS is a popular and widely used dataset in NLU research, which contains English audio recordings of people making flight reservations. This dataset comprises 4478 utterances for training and 893 utterances for the test, containing 21 intent and 120 slot tags. In the training stage, we used English training ATIS utterances as our rich-resource dataset. To achieve the low-resource dataset, we translated 500 random utterances (approximately 10 $\%$ of the original data) of ATIS from English to Persian; in the test stage, we translated the entire test set of the ATIS; in addition, we added 69 informal translated utterances.Footnote a Some examples of translated ATIS utterances with corresponding labels are shown in Figure 5.
-
MASSIVE: MASSIVE is a newly released joint NLU dataset (FitzGerald et al. Reference FitzGerald, Hench, Peris, Mackie, Rottmann, Sanchez, Nash, Urbach, Kakarala and Singh2022) composed of one million realistic, parallel, labeled virtual assistant utterances spanning 51 languages, 18 domains, 60 intents, and 55 slots (108 IOB slot tags). MASSIVE contains 12664, 2974, and 2974 samples for training, development, and testing set, respectively. Similar to ATIS, we utilized the English MASSIVE training set and 10 $\%$ of the Persian MASSIVE data for training, along with all of the Persian MASSIVE test sets for testing.
Statistics of both datasets are presented in Table 2.
4.2 Evaluation metrics
To evaluate different scenarios, we employed three standard evaluation metrics, including F1 score for slot filling, accuracy for intent detection, and EM for both intent detection and slot filling. EM is introduced to count the testing samples with absolutely correct prediction. The other metrics are computed by the equations below:
where $N$ is the number of gold slot chunks in the test set, $M$ is the number of predicted slot chunks, $S$ is the number of correctly predicted slot chunks, $T$ is the number of correctly predicted intents, and $K$ is the number of utterances.
4.3 Experiments setup
We conduct experiments on our datasets to study the usefulness of pretrained language model-based encoders. Here, we employ XLM-RoBERTaBASE (Conneau et al. Reference Conneau, Khandelwal, Goyal, Chaudhary, Wenzek, Guzmán, Grave, Ott, Zettlemoyer and Stoyanov2020) and mBERT (Kenton and Toutanova Reference Kenton and Toutanova2019) (two recent state-of-the-art pretrained language models that support Persian) as the encoders.
-
mBERT: mBERT is a BERT multilingual model with 12 layers, 768 hidden units each, 12 attention heads, pretrained on the top 104 languages (including Persian) with texts from Wikipedia using MLM and NSP objectives. The entire model has 110 M parameters.
-
XLM-RoBERTa: XLM-RoBERTa is a multilingual variant of RoBERTa, pretrained on a 2.5TB multilingual dataset on 100 languages (including Persian). It does not use the NSP task for training and is only trained using the multilingual MLM.
The maximum length of an utterance is 50. The batch size is set as 128. Adam is used for optimization with an initial learning rate of 5e-5. The dropout probability is 0.1. In the case of training over one language or the mixture of two languages, the maximum number of epochs is 20. If training is done in two phases over two languages (each phase for one language), the maximum number of epochs is 20 for each phase. The reported results are the average of five runs using five different random seeds.
5. Results
Table 3 presents the results of all training scenarios using mBERT as the encoder. The table shows that cross-lingual scenarios have been more effective, achieving better results than monolingual scenarios.
In the scenario that we only use English data, all evaluation metrics in the MASSIVE dataset and the accuracy metric for intent detection in the ATIS dataset outperforms the scenario in which only the Persian dataset has been used. The reason for this could be the larger size of English training datasets. Among all scenarios, the PR $\to$ EN exhibits the best results in the MASSIVE dataset. In the ATIS dataset, the best results have been obtained in two scenarios, PR $\to$ EN and EN + PR, which indicates that the combination of the rich-resource English dataset and the low-resource Persian dataset has been effective.
In Table 4, we have been given results for all training scenarios using XLM-RoBERTa as the encoder. As shown in this table, on the ATIS dataset, in the case that our training data are only Persian data, better performance has been achieved in all evaluation metrics than in the case where our training data are English data. However, on the MASSIVE dataset, similar to mBERT encoding results, the large size of the MASSIVE English training dataset led to better performance.
According to our experiments, the mBERT pretrained language model yields the best results for both datasets. The better performance of mBERT in contrast to XLM-RoBERTa language model can be due to the curse of multilinguality (Conneau et al. Reference Conneau, Khandelwal, Goyal, Chaudhary, Wenzek, Guzmán, Grave, Ott, Zettlemoyer and Stoyanov2020). In the ATIS dataset, the highest value of F1 was obtained in the EN $\to$ PR scenario (75.94), the highest accuracy value was achieved in the PR $\to$ EN scenario (90.64), and the highest EM value was obtained in the EN + PR mode (50.1). In the MASSIVE dataset, the highest value of all three metrics was attained in the PR $\to$ EN scenario (the obtained values for F1, Accuracy, and EM metrics, respectively, are equal to 79.88, 87.79, and 69.87).
5.1 Equal multilingual resources
In order to compare the performance of 10% of the Persian dataset with the comparable data in two languages, we used all 12664 Persian samples of the MASSIVE dataset. The results of the mBERT and XLM-RoBERTa language models have been presented in Table 5.
Not surprisingly, the performance of the MASSIVE dataset’s results in the PR scenario on mBERT language model improved by 25.68% on slots F1, 35.24% on intents accuracy, and 34.12% on EM, and when mBERT has been replaced by XLM-RoBERTa, we have been witnessing 26.16%, 55.65%, and 52.79% improvement on slots F1, intents accuracy, and EM, respectively. However, in the case of using mBERT as our pretrained language model, our experiments only obtained an improvement of 0.72% in slots F1 and 1.6% in EM and no improvement in intents accuracy for cross-lingual scenarios. In the case of using XLM-RoBERTa, the improvements in comparison to the best results of using 10% of the Persian dataset with XLM-RoBERTa are as follows: 2.78% on slots F1, 1.41% on intents accuracy, and 2.06% EM. In Figure 6, we investigate the performance variation when utilizing different percentages of Persian data in the PR $\to$ EN scenario. As it can be seen from the figure, increasing the size of the Persian dataset does not improve EM metric significantly.
5.2 Error analysis
As shown in Figure 7 and Figure 8, we demonstrate two typical testing examples, one from the ATIS dataset annotated by EN $+$ PR training model and another one from the MASSIVE dataset annotated by PR $\to$ EN training model. Both of the samples obtained the right slot filling and intent detection results by using the mBERT language model.
As shown in Figures 9 and 10, we gather two examples of each dataset for which our model fails. In the sample of Figure 9, our model obtains the wrong slot tag ‘depart_day.day_time’ for the word ‘tenth’ instead of ‘daepart_day.dayname’. In the sample of Figure 10, our model annotates the words ‘US dollar’ and ‘Iranian rial’ as slot ‘currency_name’. However, they actually represent the slot ‘news_topic’. Additionally, our model incorrectly assigns the intent tag ‘qa_currency’ instead of ‘news_query’. Nevertheless, even human beings may find it challenging to recognize the correct slots and intent.
Table 6 presents the confusion matrix for the intent detection task of our ATIS dataset. There are a few errors due to the imbalanced data problem since most utterances are labeled as ‘flight’. The intent labels such as ‘airline;flight $\_$ no’ and ‘flight;airline’ have significantly less representation in the dataset and have been miss-classified. Some slot labels have been labeled null tags in the slot filling task, primarily due to the scarcity of slot tags compared to null tags.
6. Conclusion
In this paper, we presented the first Persian public intent and slot filling dataset for task-oriented dialog systems, which consists of 500 samples for training and 962 samples for testing. This dataset is translated from English to Persian based on the ATIS dataset, and humans annotate its slots and intents labels. We evaluated the performance of different scenarios using rich-resource and low-resource data on ATIS datasets. To increase the reliability of the results of our different scenarios, we also repeated the experiments on the MASSIVE dataset. For both datasets, we consistently found that cross-lingual learning scenarios improve results compared to only training on limited amounts of data in a monolingual manner.
Future work will focus on adapting the model to deal with multi-intent scenarios. In addition, we will explore the possibility of training the model on the same scenarios to enable it to handle three or more languages simultaneously. Using data augmentation to overcome the problems of low-resource languages is another line of our future work.