Hostname: page-component-cd9895bd7-lnqnp Total loading time: 0 Render date: 2024-12-19T02:53:21.294Z Has data issue: false hasContentIssue false

Ad astra or astray: Exploring linguistic knowledge of multilingual BERT through NLI task

Published online by Cambridge University Press:  09 June 2022

Maria Tikhonova*
Affiliation:
HSE University, Moscow, Russia
Vladislav Mikhailov
Affiliation:
HSE University, Moscow, Russia
Dina Pisarevskaya
Affiliation:
Independent Resercher, London, UK
Valentin Malykh
Affiliation:
Huawei Noah’s Ark Lab, Moscow, Russia
Tatiana Shavrina*
Affiliation:
HSE University, Moscow, Russia AI Research Institute (AIRI), Moscow, Russia
*
*Corresponding author: E-mail: [email protected]; [email protected]
*Corresponding author: E-mail: [email protected]; [email protected]
Rights & Permissions [Opens in a new window]

Abstract

Recent research has reported that standard fine-tuning approaches can be unstable due to being prone to various sources of randomness, including but not limited to weight initialization, training data order, and hardware. Such brittleness can lead to different evaluation results, prediction confidences, and generalization inconsistency of the same models independently fine-tuned under the same experimental setup. Our paper explores this problem in natural language inference, a common task in benchmarking practices, and extends the ongoing research to the multilingual setting. We propose six novel textual entailment and broad-coverage diagnostic datasets for French, German, and Swedish. Our key findings are that the mBERT model demonstrates fine-tuning instability for categories that involve lexical semantics, logic, and predicate-argument structure and struggles to learn monotonicity, negation, numeracy, and symmetry. We also observe that using extra training data only in English can enhance the generalization performance and fine-tuning stability, which we attribute to the cross-lingual transfer capabilities. However, the ratio of particular features in the additional training data might rather hurt the performance for model instances. We are publicly releasing the datasets, hoping to foster the diagnostic investigation of language models (LMs) in a cross-lingual scenario, particularly in terms of benchmarking, which might promote a more holistic understanding of multilingualism in LMs and cross-lingual knowledge transfer.

Type
Article
Copyright
© The Author(s), 2022. Published by Cambridge University Press

1. Introduction

The latest advances in neural architectures of language models (LMs) (Vaswani et al., Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser and Polosukhin2017) have raised the importance of NLU benchmarks as a standardized practice of tracking progress in the field and exceeded conservative human baselines on some datasets (Raffel et al., Reference Raffel, Shazeer, Roberts, Lee, Narang, Matena, Zhou, Li and Liu2020; He et al., Reference He, Liu, Gao and Chen2021). Such LMs are centered around the “pre-train & fine-tune” paradigm, where a pretrained LM is directly fine-tuned for solving a downstream task. Despite the impressive empirical results, pretrained LMs struggle to learn linguistic phenomena from raw text corpora (Rogers Reference Rogers2021), even when increasing the size of pretraining data (Zhang et al., Reference Zhang, Warstadt, Li and Bowman2021). Furthermore, the fine-tuning procedure can be unstable (Devlin et al., Reference Devlin, Chang, Lee and Toutanova2019) and raise doubts about whether it promotes task-specific linguistic reasoning (Kovaleva et al., Reference Kovaleva, Romanov, Rogers and Rumshisky2019). The brittleness of standard fine-tuning approaches to various sources of randomness (e.g., weight initialization and training data order) can lead to different evaluation results and prediction confidences of models, independently fine-tuned under the same experimental setup. Recent research has defined this problem as (in)stability (Dodge et al., Reference Dodge, Ilharco, Schwartz, Farhadi, Hajishirzi and Smith2020); (Mosbach et al., Reference Mosbach, Andriushchenko and Klakow2020a), which now serves as a subject of an interpretation direction, aimed at exploring the consistency of linguistic generalization of LMs (McCoy et al., Reference McCoy, Frank and Linzen2018, Reference McCoy, Min and Linzen2020).

Our paper is devoted to this problem in the task of natural language inference (NLI) which has been widely used to assess language understanding capabilities of LMs in monolingual and multilingual benchmarks (Wang et al., Reference Wang, Singh, Michael, Hill, Levy and Bowman2018, Reference Wang, Pruksachatkun, Nangia, Singh, Michael, Hill, Levy and Bowman2019; Liang et al., Reference Liang, Duan, Gong, Wu, Guo, Qi, Gong, Shou, Jiang, Cao, Fan, Zhang, Agrawal, Cui, Wei, Bharti, Qiao, Chen, Wu, Liu, Yang, Campos, Majumder and Zhou2020; Hu et al., Reference Hu, Ruder, Siddhant, Neubig, Firat and Johnson2020b). The task is framed as a binary classification problem, where the model should predict if the meaning of the hypothesis is entailed with the premise. Many works show that NLI models learn shallow heuristics and spurious correlations in the training data (Naik et al., Reference Naik, Ravichander, Sadeh, Rose and Neubig2018; Glockner et al., Reference Glockner, Shwartz and Goldberg2018; Sanchez et al., Reference Sanchez, Mitchell and Riedel2018), stimulating a targeted evaluation of LMs on out-of-distribution sets covering inference phenomena of interest (Yanaka et al., Reference Yanaka, Mineshima, Bekki, Inui, Sekine, Abzianidze and Bos2019b; Yanaka et al., Reference Yanaka, Mineshima, Bekki, Inui, Sekine, Abzianidze and Bos2019a; McCoy et al., Reference McCoy, Pavlick and Linzen2019; Tanchip et al., Reference Tanchip, Yu, Xu and Xu2020). Although such datasets are extremely useful for analyzing how well LMs capture inference and abstract properties of language, English remains the focal point of the research, leaving other languages underexplored.

To this end, our work extends the ongoing research on the fine-tuning stability and consistency of linguistic generalization to the multilingual setting, covering five Indo-European languages from four language groups: English (West Germanic), Russian (Balto-Slavic), French (Romance), German (West Germanic), and Swedish (North Germanic). The contributions are summarized as twofold. First, we propose GLUE-style textual entailment and diagnostic datasetsFootnote a for French, Swedish, and German. Second, we explore the stability of linguistic generalization of mBERT across five languages mentioned above, analyzing the impact of the random seed choice, training dataset size, and presence of linguistic categories in the training data. Our work differs from similar approaches described in Section 2 in that we (i) evaluate the inference abilities through the lens of broad-coverage diagnostics, which is often neglected for upcoming LMs, typically compared among one another only by the averaged scores on canonical benchmarks (Dehghani et al., Reference Dehghani, Tay, Gritsenko, Zhao, Houlsby, Diaz, Metzler and Vinyals2021); and (ii) analyze the per-category stability of the model fine-tuning for the considered languages, testing mBERT’s cross-lingual transfer abilities.

2. Related work

NLI and diagnostic datasets. There is a wide variety of datasets constructed to facilitate the development of novel approaches to the problem of NLI (Storks et al., Reference Storks, Gao and Chai2019). The task has evolved within a series of RTE challenges (Dagan et al., Reference Dagan, Glickman and Magnini2005) and now comprises several standardized benchmark datasets such as SICK (Marelli et al., Reference Marelli, Menini, Baroni, Bentivogli, Bernardi and Zamparelli2014), SNLI (Bowman et al., Reference Bowman, Angeli, Potts and Manning2015), MNLI (Williams et al., Reference Williams, Nangia and Bowman2018), and XNLI (Conneau et al., Reference Conneau, Rinott, Lample, Williams, Bowman, Schwenk and Stoyanov2018b). Despite the rapid progress, recent work has found that these benchmarks may contain biases and annotation artifacts which raise questions whether state-of-the-art models indeed have or acquire the inference abilities (Tsuchiya Reference Tsuchiya2018; Belinkov et al., Reference Belinkov, Poliak, Shieber, Van Durme and Rush2019). Various linguistic datasets have been proposed to challenge the models and help to improve their performance on inference features (Glockner et al., Reference Glockner, Shwartz and Goldberg2018; Yanaka et al., Reference Yanaka, Mineshima, Bekki, Inui, Sekine, Abzianidze and Bos2019a, 2019b, 2020; McCoy et al., Reference McCoy, Pavlick and Linzen2019; Richardson et al., Reference Richardson, Hu, Moss and Sabharwal2020; Hossain et al., Reference Hossain, Kovatchev, Dutta, Kao, Wei and Blanco2020; Tanchip et al., Reference Tanchip, Yu, Xu and Xu2020). The MED (Yanaka et al., Reference Yanaka, Mineshima, Bekki, Inui, Sekine, Abzianidze and Bos2019a) and HELP (Yanaka et al., Reference Yanaka, Mineshima, Bekki, Inui, Sekine, Abzianidze and Bos2019b) datasets focus on aspects of monotonicity reasoning, motivating the follow-up work on systematicity of this phenomenon (Yanaka et al., Reference Yanaka, Mineshima, Bekki and Inui2020). HANS (McCoy et al., Reference McCoy, Pavlick and Linzen2019) aims at evaluating the generalization abilities of NLI models beyond memorizing lexical and syntactic heuristics in the training data. Similar in spirit, the concept of semantic fragments has been applied to synthesize datasets that target quantifiers, conditionals, monotonicity reasoning, and other features (Richardson et al., Reference Richardson, Hu, Moss and Sabharwal2020). The SIS dataset (Tanchip et al., Reference Tanchip, Yu, Xu and Xu2020) covers symmetry of verb predicates, and it is designed to improve systematicity in neural models. Another feature studied in the field is negation which has proved to be challenging not only for the NLI task (Hossain et al., Reference Hossain, Kovatchev, Dutta, Kao, Wei and Blanco2020; Hosseini et al., Reference Hosseini, Reddy, Bahdanau, Hjelm, Sordoni and Courville2021) but also for probing factual knowledge in masked LMs (Kassner and Schütze Reference Kassner and Schütze2020).

Last but not least, broad-coverage diagnostics is introduced in the GLUE benchmark (Wang et al., Reference Wang, Singh, Michael, Hill, Levy and Bowman2018) and has now become a standard dataset for examining linguistic knowledge of LMs on GLUE-style leaderboards. To the best of our knowledge, there are only two counterparts of the diagnostic dataset for Chinese and Russian, introduced in the CLUE (Xu et al., Reference Xu, Hu, Zhang, Li, Cao, Li, Xu, Sun, Yu, Yu, Tian, Dong, Liu, Shi, Cui, Li, Zeng, Wang, Xie, Li, Patterson, Tian, Zhang, Zhou, Liu, Zhao, Zhao, Yue, Zhang, Yang, Richardson and Lan2020) and Russian SuperGLUE benchmarks (Shavrina et al., Reference Shavrina, Fenogenova, Anton, Shevelev, Artemova, Malykh, Mikhailov, Tikhonova, Chertok and Evlampiev2020). Creating such datasets is not addressed in recently proposed GLUE-like benchmarks for Polish (Rybak et al., Reference Rybak, Mroczkowski, Tracz and Gawlik2020) and French (Le et al., Reference Le, Vial, Frej, Segonne, Coavoux, Lecouteux, Allauzen, Crabbé, Besacier and Schwab2020).

Stability of neural models. A growing body of recent studies has explored the role of optimization, data, and implementation choices on the stability of training and fine-tuning neural models (Henderson et al., Reference Henderson, Islam, Bachman, Pineau, Precup and Meger2018; Madhyastha and Jain Reference Madhyastha and Jain2019; Dodge et al., Reference Dodge, Ilharco, Schwartz, Farhadi, Hajishirzi and Smith2020; Mosbach et al., Reference Mosbach, Andriushchenko and Klakow2020a). Bhojanapalli et al., (Reference Bhojanapalli, Wilber, Veit, Rawat, Kim, Menon and Kumar2021) and Zhuang et al., (Reference Zhuang, Zhang, Song and Hooker2021) investigate the impact of weight initialization, mini-batch ordering, data augmentation, and hardware on the prediction disagreement between image classification models. In NLP, BERT has demonstrated instability when being fine-tuned on small datasets across multiple restarts (Devlin et al., Reference Devlin, Chang, Lee and Toutanova2019). This has motivated further research on the most contributing factors to such behavior, mostly the dataset size and the choice of random seed as a hyperparameter (Bengio Reference Bengio2012), which influences training data order and weight initialization. The studies report that changing only random seed during the fine-tuning stage can cause a significant standard deviation of the validation performance, including tasks from the GLUE benchmark (Lee et al., Reference Lee, Cho and Kang2019; Dodge et al., Reference Dodge, Ilharco, Schwartz, Farhadi, Hajishirzi and Smith2020; Mosbach et al., Reference Mosbach, Andriushchenko and Klakow2020a; Hua et al., Reference Hua, Li, Dou, Xu and Luo2021). Another direction involves studying the effect of random seeds on model performance and robustness in terms of attention interpretation and gradient-based feature importance methods (Madhyastha and Jain Reference Madhyastha and Jain2019).

Linguistic competence of BERT. A plethora of works is devoted to the linguistic analysis of BERT, and the inspection of how fine-tuning affects the model knowledge (Rogers et al., Reference Rogers, Kovaleva and Rumshisky2020). The research has covered various linguistic phenomena, including syntactic properties (Warstadt and Bowman Reference Warstadt and Bowman2019), structural information (Jawahar et al., Reference Jawahar, Sagot and Seddah2019), semantic knowledge (Goldberg Reference Goldberg2019), common sense (Cui et al., Reference Cui, Cheng, Wu and Zhang2020), and many others (Ettinger Reference Ettinger2020). Contrary to the common understanding that BERT can capture the language properties, some studies reveal that the model tends to lose the information after fine-tuning (Miaschi et al., Reference Miaschi, Brunato, Dell’Orletta and Venturi2020); (Singh et al., Reference Singh, Wallat and Anand2020); (Mosbach et al., Reference Mosbach, Khokhlova, Hedderich and Klakow2020b) and fails to acquire task-specific linguistic reasoning (Kovaleva et al., Reference Kovaleva, Romanov, Rogers and Rumshisky2019); (Zhao and Bethard Reference Zhao and Bethard2020); (Merchant et al., Reference Merchant, Rahimtoroghi, Pavlick and Tenney2020). Several works explore the consistency of linguistic generalization of neural models by independently training them from 50 to 5,000 times and evaluating their generalization performance (Weber et al., Reference Weber, Shekhar and Balasubramanian2018; Liška et al., Reference Liška, Kruszewski and Baroni2018; McCoy et al., Reference McCoy, Frank and Linzen2018; McCoy et al., Reference McCoy, Min and Linzen2020). In the spirit of these studies, we analyze the stability of the mBERT model w.r.t. diagnostic inference features, extending the experimental setup to the multilingual setting.

3. Multilingual datasets

This section describes textual entailment and diagnostic datasets for five Indo-European languages: English (West Germanic), Russian (Balto-Slavic), French (Romance), German (West Germanic), and Swedish (North Germanic). We use existing datasets for English (Wang et al., Reference Wang, Pruksachatkun, Nangia, Singh, Michael, Hill, Levy and Bowman2019) and Russian (Shavrina et al., Reference Shavrina, Fenogenova, Anton, Shevelev, Artemova, Malykh, Mikhailov, Tikhonova, Chertok and Evlampiev2020) and propose their counterparts for the other languages based on the GLUE-style methodology (Wang et al., Reference Wang, Singh, Michael, Hill, Levy and Bowman2018).

3.1 Recognizing textual entailment

The task of recognizing textual entailment is framed as a binary classification problem, where the model should predict if the meaning of the hypothesis is entailed with the premise. We provide an example from the English RTE dataset below and describe brief statistics for each language in Table 1.

  • Premise: ‘Dana Reeve, the widow of the actor Christopher Reeve, has died of lung cancer at age 44, according to the Christopher Reeve Foundation.’

  • Hypothesis: ‘Christopher Reeve had an accident.’

  • Entailment: False.

Table 1. Statistics of the NLI datasets. Vocab size refers to the total number of unique words. Num. of words stands for the average number of words in a sample. Fr = French; De = German; Sw = Swedish.

English: RTE (Wang et al., Reference Wang, Singh, Michael, Hill, Levy and Bowman2018) is a collection of datasets from a series of competitions on recognizing textual entailment, constructed from news and Wikipedia (Dagan et al., Reference Dagan, Glickman and Magnini2005; Haim et al., Reference Haim, Dagan, Dolan, Ferro, Giampiccolo, Magnini and Szpektor2006; Giampiccolo et al., Reference Giampiccolo, Magnini, Dagan and Dolan2007; Bentivogli et al., Reference Bentivogli, Clark, Dagan and Giampiccolo2009).

Russian: Textual Entailment Recognition for Russian (TERRa) (Shavrina et al., Reference Shavrina, Fenogenova, Anton, Shevelev, Artemova, Malykh, Mikhailov, Tikhonova, Chertok and Evlampiev2020) is an analog of the RTE dataset that consists of sentence pairs sampled from news and fiction segments of the Taiga corpus (Shavrina and Shapovalova Reference Shavrina and Shapovalova2017).

French, German, Swedish: Each sample from TERRa is manually translated and verified by professional translators with the linguistic peculiarities preserved, culture-specific elements localized, and ambiguous samples filtered out. The resulting datasets contain fewer unique words than the ones constructed by filtering text sources (RTE and TERRa). We relate this to the fact that translated texts may exhibit less lexical diversity and vocabulary richness (Al-Shabab Reference Al-Shabab1996; Nisioi et al., Reference Nisioi, Rabinovich, Dinu and Wintner2016).

3.2 Broad-coverage diagnostics

Broad-coverage diagnostics (Wang et al., Reference Wang, Singh, Michael, Hill, Levy and Bowman2018) is an expert-constructed evaluation dataset that consists of 1104 NLI sentence pairs annotated with linguistic phenomena under four high-level categories (see Table 2). The dataset is originally included in the GLUE benchmark. It is used as an additional test set for examining the linguistic competence of LMs, which allows for revealing possible biases and conducting a systematic analysis of the model behavior.

Table 2. The linguistic annotation of the diagnostic dataset.

As part of this study, LiDiRus (Linguistic Diagnostics for Russian), an equivalent diagnostic dataset for the Russian language, is created (Shavrina et al., Reference Shavrina, Fenogenova, Anton, Shevelev, Artemova, Malykh, Mikhailov, Tikhonova, Chertok and Evlampiev2020). The creation procedure includes a manual translation of the English diagnostic samples by expert linguists so that each indicated linguistic phenomenon and target label is preserved and culture-specific elements are localized. We apply the same procedure to construct diagnostic datasets for French, German, and Swedish by translating and localizing the English diagnostic samples. The label distribution in each dataset is 42/58% (Entailment: True/False). Consider an example of the NLI pair (Sentence 1: ‘John married Gary’; Sentence 2: ‘Gary married John’; Entailment: True) and its translation in each language:

  • English: ‘John married Gary’ entails ‘Gary married John’;

  • Russian: ‘’ entails ‘’;

  • French: ‘John a épousé entails ‘Gary a épousé John’;

  • German: ‘John heiratete Gary’ entails ‘Gary heiratete John’;

  • Swedish: ‘John gifte sig med Gary’ entails ‘Gary gifte sig med John’.

Linguistic challenges. Special attention is paid to the problems of the feature-wise translation of the examples. Since the considered languages are Indo-European, there appear fewer translation challenges. For instance, all languages have morphological negation mechanisms, lexical semantics features, common sense, and world knowledge instances. The main distinctions are related to the category of the Predicate-Argument Structure. The strategy of case coding is exhibited differently across the languages, for example, in dative constructions. Dative was widely used in all ancient Indo-European languages and is still present in modern Russian, retaining numerous functions. In contrast, dative constructions are primarily underrepresented in English and Swedish, and all the dative examples in the translations involve impersonal constructions with an indirect object instead of a subject. The same goes for genitives and partitives, where standard noun phrase syntax indicates genitive relations as Swedish and English do not have case marking. For French, the “de + noun” constructions are used to indicate partitiveness or genitiveness. Below is an example of an English sentence and its corresponding translations to Swedish and French:

  • English: ‘A formation of approximately 50 officers of the police of the City of Baltimore eventually placed themselves between the rioters and the militiamen, allowing the 6th Massachusetts to proceed to Camden Station.’;

  • Swedish: ‘Om 50 poliser i staden Baltimore, i slutändan stod mellan demonstranterna och brottsbekämpande myndigheter, vilket gjorde det möjligt för 6: e Massachusetts Volunteer Regiment går till Cadman station.’;

  • French: ‘Une cinquantaine de policiers de Baltimore se sont finalement interposés entre les manifestants et les forces de l’ordre, permettant au 6e régiment de volontaires du Massachusetts de se rendre à Cadman Station.’.

Translations for the Logic and Knowledge categories are obtained with no difficulty, for example, all existential constructions share patterns with the translated analogs of the quantifiers such as “some,” “many,” etc. However, we acknowledge that some low-level categories cannot be forwardly translated. For example, elliptic structures, are in general, quite different in Russian than in the other languages. Despite this, the translation-based method avoids the need for additional language-specific expert annotation.

4. Experimental setup

The experiments are conducted on the mBERTFootnote b model, pretrained on concatenated monolingual Wikipedia corpora in 104 languages. We use the SuperGLUE framework under the jiant environment (Pruksachatkun et al., Reference Pruksachatkun, Yeres, Liu, Phang, Htut, Wang, Tenney and Bowman2020b) to fine-tune the model multiple times for each language with a fixed set of hyperparameters while changing only the random seeds.

Fine-tuning. We follow the SuperGLUE fine-tuning and evaluation strategy with a set of default hyperparameters as follows. We fine-tune the mBERT model using a random seed $\in [0; 5]$ , batch size of 4, learning rate of $1e^{-5}$ , global gradient clipping, dropout probability of $p=0.1$ , and the AdamW optimizer (Loshchilov and Hutter Reference Loshchilov and Hutter2017). The fine-tuning is performed on 4 ChristofariFootnote c Tesla V100 GPUs (32GB) for the maximum number of 10 epochs with early stopping on the NLI validation data. The model is evaluated on the corresponding broad-coverage diagnostics dataset as described below.

Evaluation. Since the feature distribution and class ratio in the diagnostic set are not balanced, the model performance is evaluated with Matthew’s correlation coefficient (MCC), the two-class variant of the $R_3$ metric (Gorodkin Reference Gorodkin2004):

\begin{equation*}MCC = \frac{TP\times TN - FP\times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}.\end{equation*}

MCC is computed between the array of model predictions and the array of gold labels (Entailment: True/False) for each low-level linguistic feature according to the annotation (Wang et al., Reference Wang, Pruksachatkun, Nangia, Singh, Michael, Hill, Levy and Bowman2019). The range of values is $[-1; 1]$ (higher is better).

Fine-tuning stability. Fine-tuning stability has multiple definitions in recent research. The majority of studies estimate the stability as the standard deviation of the validation performance, measured by accuracy, MCC, or F1-score (Phang et al., Reference Phang, Févry and Bowman2018; Lee et al., Reference Lee, Cho and Kang2019; Dodge et al., Reference Dodge, Ilharco, Schwartz, Farhadi, Hajishirzi and Smith2020). Another possible notion is per-point stability, where a set of models is analyzed w.r.t. their predictions on the same evaluation sample (Mosbach et al., Reference Mosbach, Andriushchenko and Klakow2020a; McCoy et al., Reference McCoy, Pavlick and Linzen2019). More recent works evaluate the stability by more granular measures, such as predictive divergence, L2 norm of the trained weights, and standard deviation of subgroup validation performance (Zhuang et al., Reference Zhuang, Zhang, Song and Hooker2021). This work analyzes the stability in terms of pairwise Pearson’s correlation as follows. Given a fixed experimental setup, we compute the correlation coefficients between the MCC scores on the diagnostic datasets, achieved by the models trained with different random seeds, and average the coefficients by the total number of models (higher is better). Besides, we assess the per-category stability, that is, the standard deviation in the model performance w.r.t. random seeds for samples within a particular diagnostic category.

5. Testing the linguistic knowledge and fine-tuning stability

5.1 Language-wise diagnostics

We start with investigating how well the linguistic properties are learned given the standardized NLI dataset by fine-tuning the mBERT model on the corresponding train data for each language independently with the same hyperparameters and computing overall MCC by averaging MCC scores for each diagnostic feature. Figure 1 shows a language-wise heat map with the results we use as a “baseline” performance to analyze different experiment settings. Despite the fact that the overall MCC scores are insignificantly different from one another (e.g., German: $0.15$ , English: $0.2$ ), there is variability in how the model outputs correlate with the linguistic features w.r.t. the languages. In order to measure this variability, we compute pairwise Pearson’s correlation between the overall MCC scores and average the coefficients over the total number of language pairs. The resulting Pearson’s correlation is $0.3$ , which denotes that the knowledge obtained during fine-tuning predominantly varies across the languages, and there is no general pattern in the model behavior. For instance, Conditionals contribute to the correct predictions for English (MCC = $0.6$ ), slightly lower for French (MCC = $0.27$ ), are neutral for German (MCC = $0.09$ ) and do not help to solve the task for Russian (MCC = $-0.31$ ) and Swedish (MCC = $-0.25$ ). On the other hand, some features receive similar MCC scores for specific languages, such as Active/Passive (English: MCC = $0.38$ ; French: MCC = $0.38$ ; Russian: MCC = $0.26$ ; Swedish: MCC = $0.24$ ), Anaphora/Coreference (French: MCC = $0.21$ ; German: MCC = $0.21$ ; Russian: MCC = $0.26$ ), Common sense (French: MCC = 0; German: MCC = 0; Swedish: MCC = 0), Datives (German: MCC = $0.34$ ; Russian: MCC = $0.38$ ; Swedish: MCC = $0.34$ ), Genitives/Partitives (English: MCC = 0; French: MCC = $0.036$ ; German: MCC = 0), and Symmetry/Collectivity (English: MCC = $-0.12$ ; French: MCC = $-0.17$ ; German: MCC = $-0.17$ ).

Figure 1. Heat map of the mBERT’s language-wise evaluation on the diagnostic datasets. The brighter the color, the higher the MCC score.

Figure 2. MCC scores on the English diagnostic dataset for mBERT fine-tuned with multiple random seeds.

5.2 Fine-tuning stability and random seeds

We fine-tune the mBERT model multiple times while changing only the random seeds $\in [0;5]$ for each considered language as described in Section 4. Figure 2 shows the seed-wise results for English. The results for the other languages are presented in Appendix 8.1. The overall pattern is that the correlation of the fine-grained diagnostic features and model outputs varies w.r.t. the random seed. Namely, some features demonstrate a large variance in the MCC score over different random seeds, for example, Conditionals (English: MCC = $0.6$ [0]; MCC = $0.13$ [1, 4, 5]), Nominalization (English: MCC = $0.46$ [0]; MCC = $0.46$ [1, 3, 4, 5]), Datives (French: MCC = $0.64$ [4]; MCC = $0.76$ [5]; MCC = 0 [1, 3]), Non-monotone (French: MCC = 0 [0, 2]; MCC = $-0.58$ [4]; MCC = $0.21$ [5]), Genitives/Partitives (German: MCC = 0 [0, 1]; MCC = $0.56$ [2]; MCC = $-0.29$ [4]), Restrictivity (Russian: MCC = $0.12$ [0, 2, 5]; MCC = 0 [3, 4]; MCC = $-0.65$ [1]), and Redundancy (Swedish: MCC = $0.34$ [2]; MCC = 0 [3]; MCC = $0.8$ [5]). On the one hand, a number of features positively correlates with the model predictions regardless the random seed, such as Core args, Intersectivity, Prepositional phrases, Datives (English); Active/Passive, Existential, Upward monotone (French); Anaphora/Coreference and Universal (German); Factivity and Redundancy (Russian); Symmetry/Collectivity and Upward monotone (Swedish). Some features, on the other hand, predominantly receive negative MCC scores: Disjunction and Intervals/Numbers (English), Symmetry/Collectivity (French and Russian), Coordination scope and Double negation (German), Conditionals and Temporal (Swedish). Table 3 aggregates the results of the seed-wise diagnostic evaluation for each language. While overall MCC scores within each language insignificantly differ, the mBERT model still have a weak correlation with the linguistic properties. Besides, the pairwise Pearson’s correlation coefficients between the RS modelsFootnote d vary between languages up to $0.22$ , which specifies that fine-tuning stability of the mBERT model is dependent upon language.

Table 6. (see Appendix 8.1) presents granular results of the per-category fine-tuning stability of the mBERT model for each language. We now describe the categories that have received the less and most significant standard deviations in the MCC scores over multiple random seeds. For most of the languages, the most stable categories are Common sense ( $\sigma \in [0.04;\; 0.09]$ ) and Factivity ( $\sigma \in [0.04;\; 0.1$ ]), while the most unstable ones are the categories of the Lexical Semantics, Logic and Predicate-Argument Structure, for example, Genitives/Partitives ( $\sigma \in [0.17;\; 0.31]$ ), Datives ( $\sigma \in [0.12;\; 0.34]$ ), Restrictivity ( $\sigma \in [0.04;\; 0.3]$ ), and Redundancy ( $\sigma \in [0.16;\; 0.32]$ ). The variance in the performance indicates the inconsistency of the linguistic generalization on a certain group of categories both collectively and discretely for the languages.

5.3 Fine-tuning stability and dataset size

Recent and contemporaneous studies report that a small number of training samples leads to unstable fine-tuning of the BERT model (Devlin et al., Reference Devlin, Chang, Lee and Toutanova2019; Phang et al., Reference Phang, Févry and Bowman2018; Zhu et al., Reference Zhu, Cheng, Gan, Sun, Goldstein and Liu2019; Pruksachatkun et al., Reference Pruksachatkun, Phang, Liu, Htut, Zhang, Pang, Vania, Kann and Bowman2020a; Dodge et al., Reference Dodge, Ilharco, Schwartz, Farhadi, Hajishirzi and Smith2020). Toward that end, we conduct two experiments to investigate how additional training data impacts the fine-tuning stability in the cross-lingual transfer setting and how it changes while the number of training samples gradually increases. We use the MNLI (Williams et al., Reference Williams, Nangia and Bowman2018) dataset for English and collapse “neutral” and “contradiction” samples into the “not entailment” label to meet the format of the RTE task (Wang et al., Reference Wang, Pruksachatkun, Nangia, Singh, Michael, Hill, Levy and Bowman2019). The resulting number of the additional training samples is 374k which are added to each language’s corresponding RTE training data.

Table 3. Results of the fine-tuning stability experiments w.r.t. random seeds for each language. Overall MCC = overall MCC scores of each RS model averaged by the total number of RS models. RS corr. = pairwise Pearson’s correlation coefficients between the RS models’ MCC scores, averaged by the total number of random seed pairs.

Does extra data in English improve stability for all languages? To analyze the performance patterns, we compute deltas between the feature-wise MCC scores and standard deviation values ( $\sigma$ ) when using a single RTE training dataset (see Section 5.2) and a combination of the RTE and MNLI training datasets. Figure 3 shows heat maps of how the fine-tuning stability has changed after fine-tuning on the additional data. We find that the MCC scores have increased for 32% categories among all languages on average (delta between the MCC scores is more than $0.1$ ). The per-category fine-tuning stability has improved for 34% of categories among all languages on average (delta between the $\sigma$ values is below $-0.05$ )Footnote e. An interesting observation is that some categories receive confident performance improvements for all languages (the MCC delta is above $0.2$ ). Such categories include Conjunction, Coordination scope, Genitives/Partitives, Non-monotone, Prepositional phrases, Redundancy, and Relative clauses. However, the additional data does not help for learning the Disjunction and Downward monotone categories and even hurts the performance as opposed to the results in Section 5.2. We also find that 61% of categories for Russian have the $\sigma$ deltas below $-0.05$ , indicating that the per-category stability can be greatly improved by extending the training data with examples in the English language.

Table 4 presents the results of this setting with a comparison to the previous experiments where the model is fine-tuned on the standardized train data size with multiple random seeds (see Section 5.1 and 5.2). The overall trend is that extension of the RTE training data with the MNLI samples helps to improve the fine-tuning stability for each language. Overall MCC scores for the diagnostic features have increased from $0.177$ to $0.263$ on average (up by 49%), and the average standard deviation decreased by $0.166$ . Analyzing the impact on the fine-tuning stability w.r.t. random seed (see Appendix 8.2), we observe that variance in the MCC scores between the RS models has predominantly decreased for all languages. Moreover, pairwise Pearson’s correlation coefficients between the RS models have improved from $0.509$ to $0.837$ on average (up by 64%).

How many training samples are required for stability? To investigate the fine-tuning stability in the context of the training data size, we fine-tune the mBERT model as described in Section 4, while changing random seed $\in [0;\; 5]$ and gradually adding the MNLI samples $\in [1k, 5k, 10k, 50k, 100k, 200k, 250k, 374k]$ to the RTE training data for English and Russian. Figure 4 shows the results of this experiment. Despite the fact that the overall MCC scores stop increasing at the size of $RTE + 10k$ for both languages, the RS corr. is steadily improving, indicating a smaller variance in the MCC scores between the RS models. Besides, the model needs more data to improve the stability for Russian (recall that we add extra data in English).

Figure 3. Feature-wise heat maps of the performance patterns after fine-tuning on combined RTE and MNLI training datasets. Left: Delta between MCC scores (higher is better). Right: Delta between standard deviation values (lower is better).

5.4 Fine-tuning stability and presence of linguistic categories

We conduct the following experiment to investigate the relationship between the fine-tuning stability and particular diagnostic categories in the training data. We design a rule-based pipeline for annotating 15 out of 33 diagnostic features for English and Russian. Then, we evaluate the model depending on their presence percentage in the corresponding RTE training dataset combined with 10k training samples from MNLI (this amount of extra data is selected based on the results in Section 5.3.).

Table 4. Results of the fine-tuning stability w.r.t using additional MNLI training samples in the cross-lingual transfer setting. Overall MCC = overall MCC scores of each RS model averaged by the total number of RS models. RS corr. = pairwise Pearson’s correlation coefficients between the RS models’ MCC scores, averaged by the total number of random seed pairs.

Figure 4. Results of the fine-tuning stability w.r.t. the number of additional MNLI training samples added to the RTE training data for English and Russian. Overall MCC = overall MCC scores of each RS model averaged by the total number of RS models. RS corr. = pairwise Pearson’s correlation coefficients between the RS models’ MCC scores, averaged by the total number of random seed pairs.

Figure 5. Distribution of the model MCC scores when fine-tuned on the combined data (RTE + 10k) as opposed to the standardized dataset size.

Description of annotation pipeline. Our study suggests that annotation of low-level diagnostic categories can be partially automatized based on features expressed lexically or grammatically. Lexical Semantics can be detected by the presence of quantifiers, negation morphemes, factivity verbs, and proper nouns. Logic features can be expressed with the indicators of temporal relations (mostly prepositions, conjunctions, particles, and deictic words), negation, and conditionals. Features from the Predicate-Argument Structure category can be identified with pronouns and syntactic tags (e.g., Relative clauses, Datives, etc.). However, Knowledge categories cannot be obtained in this manner.

Such approach relies only on the surface representation of the feature and is limited by the coverage of the predefined rules, thus giving space to false-negative results. Keeping this in mind, we construct a set of linguistic heuristics to identify the presence of a particular feature based on the morphosyntactic and NER annotation with spaCyFootnote f for English, and built-in dictionaries and morphological analysis with pymorphy2 for Russian (Korobov Reference Korobov2015). We also construct specific word lists for most of the features for both languages, for example, “all,” “some,” “every,” “any,” “anyone,” “everyone,” “nothing,” etc. (Quantifiers). The heuristics for the Russian language have several differences. For instance, dative constructions are detected by the morphological analysis of the nouns or pronouns, as the case is explicitly expressed in the flexion.

Stability and category distribution. We use the pipeline to annotate each training sample from RTE, TERRa, and the MNLI 10k subset. Table 7 presents the feature distributions for the datasets (see Appendix 8.3). Figure 5 depicts the model performance trajectories when fine-tuned on the combined data as opposed to the standardized dataset size (see Section 5.1). The behavior is predominantly similar for both languages, and there is a strong correlation of $0.94$ between the MCC performance improvements. We select four features for further analysisFootnote g: Conjunction (the MCC score improved for both languages), Anaphora/Coreference (there is a significant difference in the feature distribution between RTE and MNLI, and no such difference between TERRa and MNLI), Negation (the MCC score decreased for both languages, and the feature distribution differs between the languages), and Disjunction (the MCC score decreased for both languages). For each considered feature, we construct three controllable subsets with a varying percentage of the presence in the training data. We follow the same fine-tuning and evaluation strategy (see Section 4), changing random seed $\in [0; 5]$ and the feature percentage presence $\in [25, 50, 75]$ . Table 5 presents the results of the experiment. The general pattern observed for both languages is that adding more feature-specific training samples may rather hurt the fine-tuning stability along with the MCC score for the feature.

Table 5. Results of the fine-tuning stability w.r.t. varying degree of the feature distribution in the MNLI subset for English and Russian. Feature MCC = feature MCC score of each RS model averaged by the total number of RS models. RS corr. = pairwise Pearson’s correlation coefficients between the RS models’ MCC scores, averaged by the total number of random seed pairs.

Feature MCC. The highest MCC scores for English are achieved when adding 50% (Conjunction, Negation), or 75% extra samples (Anaphora/Coreference, Disjunction). In contrast, this amount of data has decreased the MCC performance for Russian (Conjunction, Negation). Instead, the minimum number of 25% additional samples are required to receive the best MCC scores for the categories of Conjunction and Disjunction. Negation obtains an insignificant improvement when adding 75% samples, and Anaphora/Coreference is of $0.223$ MCC at 50% extra data.

Fine-tuning stability. Despite the fact that the feature MCC scores may increase, the fine-tuning stability may decrease for the identical amounts of additional training samples, for example, Conjunction (English and Russian), Negation (Russian), Anaphora/Coreference (English), and Disjunction (English and Russian). The minor variance between the RS models is predominantly the 25% or 50% extra data size for both languages.

Probing analysis. To analyze from another perspective, we apply the annotation pipeline to construct three probing tasks, aimed at identifying the presence of categories of Logic, Lexical Semantics, and Predicate-Argument structure. More details can be found in Appendix 8.4.

6. Discussion

Acquiring linguistic knowledge through NLI. A thorough language-wise analysis using the proposed multilingual datasets reveals how well the model learns the phenomena it is intended to learn for solving the NLI task. Despite the variability in the MCC performance, mBERT shows a similar behavior on a number of features on the languages that differ in their richness of morphology and syntax (see Section 5.1). Specifically, the model outputs are positively correlated with the following diagnostic categories that reflect the language peculiarities: Logic (Upward monotone, Conditionals, Existential, Universal, and Conjunction), Lexical semantics (Named entities), and Predicate-Argument structure (Ellipsis, Coordination scope, and Anaphora/Coreference). On the contrary, there is a number of features that predominantly receive negative MCC scores: Logic (Disjunction, Downward monotone, and Intervals/Numbers) and Predicate-Argument structure (Restrictivity). The Logic features are reminiscent of the properties of formal semantics, which captures the meaning of linguistic expressions through their logical interpretation utilizing formal models (Venhuizen et al., Reference Venhuizen, Hendriks, Crocker and Brouwer2021). Monotonicity (Upward/Downward monotone), as one of such features, covers various systematic patterns and allows for assessing inferential systematicity in natural languages. In line with (Yanaka et al., Reference Yanaka, Mineshima, Bekki, Inui, Sekine, Abzianidze and Bos2019b), our results show that the model generally struggles to learn the Downward monotone inferences with Disjunction for all languages. Another phenomenon to which mBERT is insensitive is the category of Negation. The model outputs weakly correlate with the true labels when the sample contains Negation, Double negation, and Morphological negation, indicating that the model fails to infer this core construction, which is a well-studied problem in the field (Naik et al., Reference Naik, Ravichander, Sadeh, Rose and Neubig2018; Ettinger Reference Ettinger2020; Hosseini et al., Reference Hosseini, Reddy, Bahdanau, Hjelm, Sordoni and Courville2021). Recently, Wallace et al., (Reference Wallace, Wang, Li, Singh and Gardner2019) have shown that it is difficult for contextualized LMs to generalize beyond the numerical values seen during training, and various datasets and model improvements have been proposed to analyze and enhance the understanding of numeracy (Thawani et al., Reference Thawani, Pujara, Ilievski and Szekely2021). The results for the category Intervals/Numbers in the context of the NLI problem reveal that numerical reasoning does not correlate with the expected model behavior (German and Russian) and even confuses the model (English, French, and Swedish). We also find that the results for the category of Symmetry/Collectivity (Lexical Semantics) vary between the considered languages, achieving negative MCC scores for most of them (English, French, and German). We relate this to the fact that the model may overly rely on the knowledge about entities and relations between them, refined from the pretraining corpora, so that linguistic expressions of the features are ignored (Tanchip et al., Reference Tanchip, Yu, Xu and Xu2020; Kassner and Schütze Reference Kassner and Schütze2020). Last but not least, we find that broadly defined categories such as Common sense and World knowledge do not show a significant correlation for all analyzed languages.

Comparing our results with the diagnostic evaluation of Chinese Transformer-based models on the NLI task (Xu et al., Reference Xu, Hu, Zhang, Li, Cao, Li, Xu, Sun, Yu, Yu, Tian, Dong, Liu, Shi, Cui, Li, Zeng, Wang, Xie, Li, Patterson, Tian, Zhang, Zhou, Liu, Zhao, Zhao, Yue, Zhang, Yang, Richardson and Lan2020), we observe the following similar trendsFootnote h. Consistent with our findings, Common sense and Monotonicity appear to be quite challenging to learn. However, the results for low-level categories that fall under Predicate-Argument Structure might differ. While the Chinese LMs achieve an average accuracy score of 58% on this category, mBERT has a hard time dealing with Nominalization or Restrictivity but tends to learn Coordination scope, Prepositional phrases, and Genitives/Partitives. At the same time, predictions of mBERT weakly correlate with Double negation, but the Chinese models receive an average accuracy score of 60%. Similarly, Lexical semantics is one of the best-learned Chinese categories; however, the mBERT model does not demonstrate a consistent behavior on the corresponding low-level categories. A more detailed investigation of cross-lingual LMs on these typologically diverse languages may shed light on how the models learn linguistic properties crucial for the NLI task and provide more insights on the cross-lingual transfer of language-specific categories and markers (Hu et al., Reference Hu, Zhou, Tian, Zhang, Patterson, Li, Nie and Richardson2021).

The impact of random seeds. Our results are consistent with McCoy et al., (Reference McCoy, Min and Linzen2020) who find that the instances of BERT fine-tuned on MNLI vary widely in their performance on the HANS dataset. In our work, the examination of the mBERT’s performance on the diagnostic datasets reveals a significant variance in the MCC scores and standard deviation w.r.t. random seeds for the majority of considered languages (see Section 5.2, Appendix 8.1). We observe significant standard deviations in the diagnostic performance, which indicates both per-language and per-category fine-tuning instability of the mBERT model. The findings highlight the importance of evaluating models on multiple restarts, as the scores obtained by a single model instance may not extrapolate to other instances, specifically in the multilingual benchmarks such as XGLUE (Liang et al., Reference Liang, Duan, Gong, Wu, Guo, Qi, Gong, Shou, Jiang, Cao, Fan, Zhang, Agrawal, Cui, Wei, Bharti, Qiao, Chen, Wu, Liu, Yang, Campos, Majumder and Zhou2020) and XTREME (Hu et al., Reference Hu, Ruder, Siddhant, Neubig, Firat and Johnson2020b). Namely, the features that are crucial for diagnostic analysis of LMs might not be appropriately learned by a particular instance, which may underscore their generalization abilities on the canonical leaderboards or even question whether LMs are indeed capable of capturing them either from pretraining or fine-tuning data. The statements are supported by the probing analysis, which shows that fine-tuning of mBERT on the RTE tasks with varying random seeds may unpredictably affect the model’s knowledge (see Appendix 8.4). Specifically, the effect can be abstracted as twofold: fine-tuned mBERT model either “forget” about a peculiar linguistic category, or “acquire” the uncertain knowledge which is demonstrated by sharp increases and decreases in the probe performance over several languages (Singh et al., Reference Singh, Wallat and Anand2020).

The impact of dataset size and feature proportions. Prior studies have reported contradictory results about the effect of adding/augmenting training data on the linguistic generalization and inference capabilities of LMs. Some works demonstrate that counterfactually augmented data does not yield generalization improvements on the NLI task (Huang et al., Reference Huang, Liu and Bowman2020). However, most recent studies show that fine-tuning BERT on additional NLI samples that cover particular inference features improves their understanding while retaining or increasing the downstream performance on NLI benchmarks (Yanaka et al., 2020, Reference Yanaka, Mineshima, Bekki, Inui, Sekine, Abzianidze and Bos2019b; Richardson et al., Reference Richardson, Hu, Moss and Sabharwal2020; Min et al., Reference Min, McCoy, Das, Pitler and Linzen2020; Hosseini et al., Reference Hosseini, Reddy, Bahdanau, Hjelm, Sordoni and Courville2021). Besides, the proportion of the features in the training data can be crucial for the model performance (Yanaka et al., Reference Yanaka, Mineshima, Bekki, Inui, Sekine, Abzianidze and Bos2019a). One of the closely related works by (Hu et al., Reference Hu, Zhou, Tian, Zhang, Patterson, Li, Nie and Richardson2021) tests cross-lingual transfer abilities of XLM-R (Conneau et al., Reference Conneau, Khandelwal, Goyal, Chaudhary, Wenzek, GuzmÁn, Grave, Ott, Zettlemoyer and Stoyanov2020) on the NLI task for Chinese, exploring configurations of fine-tuning the model on combinations of Chinese and English data and evaluating it on diagnostic datasets. Particularly, the model achieves the best performance when fine-tuned on concatenated OCNLI (Hu et al., Reference Hu, Richardson, Xu, Li, Kübler and Moss2020a) and English NLI datasets (e.g., Bowman et al., Reference Bowman, Angeli, Potts and Manning2015; Williams et al., Reference Williams, Nangia and Bowman2018; Nie et al., Reference Nie, Williams, Dinan, Bansal, Weston and Kiela2020) on the majority of covered diagnostic features, including uniquely Chinese ones: Idioms, Non-core argument, Pro-drop, Time of event, Anaphora, Argument structure, Comparatives, Double negation, Lexical semantics, and Negation. The results suggest that XLM-R can learn meaningful linguistic representations beyond surface properties and even strengthen the knowledge with the transfer from English, outperforming its monolingual counterparts.

Consistent with the latter studies, we find that extra data only in English provides better generalization capabilities of mBERT for all considered languages, which differ in their peculiarities of morphology and syntax. We also observe that using additional English data improves the fine-tuning stability, resulting in lower standard deviation values and higher Pearson’s correlation between the model instances’ scores (see Section 5.3). Another finding is that the number of training examples containing a particular feature might be critical for both diagnostic performance and fine-tuning stability of the mBERT model (see Section 5.4).

Limitations. The concept of benchmarking has become a standard paradigm for evaluating LMs against one another and human solvers, and dataset design protocols for the other languages are generally reproduced from English. However, there are still several methodological concerns, one of which is the dataset design and annotation choices (Rogers Reference Rogers2019; Dehghani et al., Reference Dehghani, Tay, Gritsenko, Zhao, Houlsby, Diaz, Metzler and Vinyals2021). It should be noted that a relatively small number of dataset samples has a common basis in benchmarking due to expensive annotation or the need for expert competencies. Unlike datasets for machine-reading comprehension, such as MultiRC (Khashabi et al., Reference Khashabi, Chaturvedi, Roth, Upadhyay and Roth2018) and ReCoRD (Zhang et al., Reference Zhang, Liu, Liu, Gao, Duh and Durme2018), the GLUE-style datasets for learning choice of alternatives, logic, and causal relationships are often represented by a smaller number of manually collected and verified samples. They are by design sufficient for the human type of generalization but often pose a challenge for the tested LMs. The broad-coverage diagnostic dataset is standard practice for assessing linguistic generalization of LMs. Nevertheless, it contains 1104 samples, and the number of samples for certain features includes only 14 samples (Universal and Existential). These dataset design choices might not provide an opportunity for a fair comparison and reliable interpretation of LMs, which might be supported by bootstrap techniques or construction of evaluation sets balanced by the number of analyzed phenomena. Evaluating datasets for sufficiency for in-distribution and out-of-distribution generalization is another relevant challenge in the field. The solution might significantly help both in interpreting model learning outcomes and in designing better evaluation suites and benchmarks. Recall that our results might not be transferable to other multilingual models, specifically different in the architecture design and pretraining objectives, for example, XLM-R, mBART (Liu et al., Reference Liu, Gu, Goyal, Li, Edunov, Ghazvininejad, Lewis and Zettlemoyer2020), and mT5 (Xue et al., Reference Xue, Constant, Roberts, Kale, Al-Rfou, Siddhant, Barua and Raffel2021).

7. Conclusion

This paper presents an extension of the ongoing research on the fine-tuning stability and consistency of linguistic generalization to the multilingual setting. We propose six GLUE-style textual entailment and broad-coverage diagnostic datasets for French, German, and Swedish. The datasets are constructed by translating the original datasets for English and Russian, with culture-specific phenomena localized and language phenomena adapted under linguistic expertise. We address the problem in the NLI task and analyze the linguistic competence of the mBERT model along with the impact of the random seed choice, training data size, and presence of linguistic categories in the training data. The method includes the standard SuperGLUE fine-tuning and evaluation procedure, and we ensure that the model is run with precisely the same hyperparameters but with different random seeds. The mBERT model demonstrates the per-category instability generally for categories that involve lexical semantics, logic, and predicate-argument structure and struggles to learn monotonicity, negation, numeracy, and symmetry. However, related languages show similar performance in active and passive voice, conjunction, disjunction, prepositional phrases, and quantifiers. We also find that the generalization performance and fine-tuning stability can be improved for all languages by using additional data only in English, contributing to the cross-lingual transfer capabilities of multilingual LMs. However, the number of training samples containing a particular feature might also hurt all model instances’ performance. We leave a more detailed investigation of this behavior for future work. Another fruitful direction is analyzing a more diverse set of monolingual and multilingual LMs, varying by the architecture design and pretraining objectives. In general, our results are consistent with a growing body of related studies which explore aspects of learning inference properties from different perspectives, including findings for Chinese, a language typologically different from the considered ones in our work. We are publicly releasing the datasets, hoping to foster the diagnostic investigation of LMs in a cross-lingual scenario, particularly in terms of benchmarking, which might promote a more holistic understanding of multilingualism in LMs and their cross-lingual knowledge transfer abilities.

Funding statement

The work has been supported by the Ministry of Science and Higher Education of the Russian Federation within Agreement No 075-15-2020-793.

8. Appendix

8.1 Fine-tuning stability and random seeds

Table 6 presents the results of the per-category fine-tuning stability for each language.

Table 6. Results of the per-category fine-tuning stability for each language. The MCC scores are averaged over the total number of RS models. Average = The results averaged over five languages.

Figures 69 show the results of the diagnostic evaluation of the mBERT model fine-tuned with multiple random seeds on the corresponding RTE dataset for each language (see Section 5.2).

Figure 6. MCC scores on the French diagnostic dataset for mBERT fine-tuned with multiple random seeds.

Figure 7. MCC scores on the German diagnostic dataset for mBERT fine-tuned with multiple random seeds.

Figure 8. MCC scores on the Russian diagnostic dataset for mBERT fine-tuned with multiple random seeds.

Figure 9. MCC scores on the Swedish diagnostic dataset for mBERT fine-tuned with multiple random seeds.

8.2 Fine-tuning stability and dataset size

Figure 10 depicts the results of the language-wise diagnostic evaluation of mBERT when fine-tuned on combined RTE and MNLI training samples. Comparing the heat map with that of Figure 1 (see Section 5.2), we observe that MCC scores for some categories have greatly improved for all languages (Conjunction, Coordination scope, Core args, Genitives/Partitives, Prepositional phrases, and Universal), while logic categories negatively correlate with the model predictions (Disjunction, Downward monotone, and Intervals/Numbers). Figures 1115 show seed-wise diagnostic evaluation of the mBERT model when fine-tuned on combined RTE and MNLI training datasets with multiple random seeds.

Figure 10. Language-wise diagnostic evaluation of mBERT when fine-tuned on combined RTE and & MNLI training datasets.

Figure 11. mBERT’s seed-wise English diagnostic evaluation when fine-tuned on combined RTE and MNLI training datasets.

Figure 12. mBERT’s seed-wise French diagnostic evaluation when fine-tuned on combined RTE and MNLI training datasets.

Figure 13. mBERT’s seed-wise German diagnostic evaluation when fine-tuned on combined RTE and MNLI training datasets.

Figure 14. mBERT’s seed-wise Russian diagnostic evaluation when fine-tuned on combined RTE and MNLI training datasets.

Figure 15. mBERT’s seed-wise Swedish diagnostic evaluation when fine-tuned on combined RTE and MNLI training datasets.

8.3 Automatic annotation of diagnostic features

Table 7. Distribution of 15 diagnostic features in the RTE training datasets for English and Russian, and in the 10k MNLI subset according to the automatic annotation pipeline.

8.4 Coarse-grained probing analysis

A prominent methodology to explore the inner workings of pretrained LMs is to train a lightweight classifier over features produced by them to predict a linguistic property. During the probing procedure, the hidden representations produced by the model are taken from various layers of the transformer, and then a simple classifier is trained to predict a linguistic feature based on the given supervision (e.g., whether a particular category is present in a sentence or not). The underlying assumption is that if the classifier can predict the property, then the representations implicitly encode the linguistic knowledge.

Figure 16. Probing results for the category of Logic. X-axis is the layer number, while Y-axis refers to the classifier performance (accuracy score).

Figure 17. Probing results for the category of lexical semantics. X-axis is the layer number, while Y-axis refers to the classifier performance (accuracy score).

Figure 18. Probing results for the category of predicate-argument structure. X-axis is the layer number, while Y-axis refers to the classifier performance (accuracy score).

We apply the annotation procedure (see Section 5.4) to create a set of three binary classification tasks for English and Russian that correspond to the coarse-grained diagnostic categories of Logic, Lexical Semantics, and Predicate-Argument structure. The task is to identify if a particular category is present in a given pair of sentences. We follow the SentEval probing methodology (Conneau et al., Reference Conneau, Kruszewski, Lample, Barrault and Baroni2018a) to train a linear classifier using cross-entropy loss, optimized with Adam (Kingma and Ba Reference Kingma and Ba2014). The classifier is trained on the corresponding annotated RTE dataset’s concatenated train and validation sets. We tune the L2-regularization parameter $\in [0.1, ..., 1{e}^{-5}]$ on the RTE test set and evaluated performance on the diagnostic set using accuracy score. The input to the classifier is a concatenation of the mean-pooled intermediate representations of each sentence in a given pair. We probe a pretrained mBERT model as a reference, and six mBERT models fine-tuned on the RTE task with multiple random seeds $\in [0;\; 5]$ (see Section 5.2).

We now provide a brief description of the probing results. The overall pattern is that the probing trajectories across the models are more consistent for English than Russian. Specifically, the linguistic properties tend to be more localized in the lower layers than in the higher ones, meaning that the latter is more affected by the fine-tuning (Wu et al., Reference Wu, Belinkov, Sajjad, Durrani, Dalvi and Glass2020). Note that the lower and middle layers of the models for Russian are less similar, which is demonstrated by sharp increases and decreases in the probe performance. Besides, the fine-tuning effect differs across the tasks, for example, leading to better performance over the Lexical Semantics task for English, and vice versa for Russian (see Figure 17). This can be interpreted as follows: the fine-tuning unpredictably causes the model either to “forget” about a particular knowledge or to “acquire” the knowledge of low certainty, shown over several RS models for both English and Russian. Despite the varying trajectories, the performance results remained similar for both languages, ranging from being close to or below random choice (see Figures 17 and 18) to becoming more confident in the lower layers on the Logic tasks (see Figure 16), with overall quality around 65% accuracy score.

Footnotes

d We refer to the RS model as the model instance fine-tuned with a specific random seed value.

e The percentage corresponds to the fraction of the heat map cell values for all languages that are higher/lower than a specified threshold for the corresponding metric. The thresholds are chosen empirically and can be adjusted depending on the strictness of the experimental setting.

g Our future work includes analysis of the other features, specifically for French, German, and Swedish.

h Note that the results are not directly comparable in terms of target metrics, dataset domains, and models.

References

Al-Shabab, O. (1996). Interpretation and the language of translation: creativity and conventions in translation.Google Scholar
Belinkov, Y., Poliak, A., Shieber, S., Van Durme, B. and Rush, A. (2019). Don’t take the premise for granted: Mitigating artifacts in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, pp. 877891.CrossRefGoogle Scholar
Bengio, Y. (2012). Practical recommendations for gradient-based training of deep architectures.CrossRefGoogle Scholar
Bentivogli, L., Clark, P., Dagan, I. and Giampiccolo, D. (2009). The fifth pascal recognizing textual entailment challenge. In TAC.Google Scholar
Bhojanapalli, S., Wilber, K., Veit, A., Rawat, A.S., Kim, S., Menon, A. and Kumar, S. (2021). On the reproducibility of neural network predictions. arXiv preprint arXiv:2102.03349.Google Scholar
Bowman, S.R., Angeli, G., Potts, C. and Manning, C.D. (2015). A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon, Portugal: Association for Computational Linguistics, pp. 632642.CrossRefGoogle Scholar
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., GuzmÁn, F., Grave, E., Ott, M., Zettlemoyer, L. and Stoyanov, V. (2020). Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, pp. 84408451.CrossRefGoogle Scholar
Conneau, A., Kruszewski, G., Lample, G., Barrault, L. and Baroni, M. (2018a). What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics, pp. 21262136.CrossRefGoogle Scholar
Conneau, A., Rinott, R., Lample, G., Williams, A., Bowman, S., Schwenk, H. and Stoyanov, V. (2018b). XNLI: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics, pp. 24752485.CrossRefGoogle Scholar
Cui, L., Cheng, S., Wu, Y. and Zhang, Y. (2020). Does bert solve commonsense task via commonsense knowledge?Google Scholar
Dagan, I., Glickman, O. and Magnini, B. (2005). The pascal recognising textual entailment challenge. In Machine Learning Challenges Workshop. Springer, pp. 177190.Google Scholar
Dehghani, M., Tay, Y., Gritsenko, A.A., Zhao, Z., Houlsby, N., Diaz, F., Metzler, D. and Vinyals, O. (2021). The benchmark lottery.Google Scholar
Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, pp. 41714186.Google Scholar
Dodge, J., Ilharco, G., Schwartz, R., Farhadi, A., Hajishirzi, H. and Smith, N. (2020). Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv preprint arXiv:2002.06305.Google Scholar
Ettinger, A. (2020). What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models. Transactions of the Association for Computational Linguistics, 8, 3448.CrossRefGoogle Scholar
Giampiccolo, D., Magnini, B., Dagan, I. and Dolan, B. (2007). The third PASCAL recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing. Prague: Association for Computational Linguistics, pp. 19.CrossRefGoogle Scholar
Glockner, M., Shwartz, V. and Goldberg, Y. (2018). Breaking NLI systems with sentences that require simple lexical inferences. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Melbourne, Australia: Association for Computational Linguistics, pp. 650655.CrossRefGoogle Scholar
Goldberg, Y. (2019). Assessing BERT’s syntactic abilities.Google Scholar
Gorodkin, J. (2004). Comparing two k-category assignments by a k-category correlation coefficient. Computational Biology and Chemistry 28(5–6), 367374.CrossRefGoogle ScholarPubMed
Haim, R.B., Dagan, I., Dolan, B., Ferro, L., Giampiccolo, D., Magnini, B. and Szpektor, I. (2006). The second pascal recognising textual entailment challenge. In Proceedings of the Second PASCAL Challenges Workshop on Recognising Textual Entailment.Google Scholar
He, P., Liu, X., Gao, J. and Chen, W. (2021). Deberta: Decoding-enhanced bert with disentangled attention.Google Scholar
Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D. and Meger, D. (2018). Deep reinforcement learning that matters. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32.CrossRefGoogle Scholar
Hossain, M.M., Kovatchev, V., Dutta, P., Kao, T., Wei, E. and Blanco, E. (2020). An analysis of natural language inference benchmarks through the lens of negation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, pp. 91069118.CrossRefGoogle Scholar
Hosseini, A., Reddy, S., Bahdanau, D., Hjelm, R.D., Sordoni, A. and Courville, A. (2021). Understanding by understanding not: Modeling negation in language models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association for Computational Linguistics, pp. 13011312.CrossRefGoogle Scholar
Hu, H., Richardson, K., Xu, L., Li, L., Kübler, S. and Moss, L.S. (2020a). Ocnli: Original chinese natural language inference. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 35123526.CrossRefGoogle Scholar
Hu, H., Zhou, H., Tian, Z., Zhang, Y., Patterson, Y., Li, Y., Nie, Y. and Richardson, K. (2021). Investigating transfer learning in multilingual pre-trained language models through Chinese natural language inference. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Online: Association for Computational Linguistics, pp. 37703785.CrossRefGoogle Scholar
Hu, J., Ruder, S., Siddhant, A., Neubig, G., Firat, O. and Johnson, M. (2020b). Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization.Google Scholar
Hua, H., Li, X., Dou, D., Xu, C. and Luo, J. (2021). Noise stability regularization for improving BERT fine-tuning. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association for Computational Linguistics, pp. 32293241.CrossRefGoogle Scholar
Huang, W., Liu, H. and Bowman, S.R. (2020). Counterfactually-augmented SNLI training data does not yield better generalization than unaugmented data. In Proceedings of the First Workshop on Insights from Negative Results in NLP. Online: Association for Computational Linguistics, pp. 8287.CrossRefGoogle Scholar
Jawahar, G., Sagot, B. and Seddah, D. (2019). What does BERT learn about the structure of language? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, pp. 36513657.Google Scholar
Kassner, N. and Schütze, H. (2020). Negated and misprimed probes for pretrained language models: Birds can talk, but cannot fly. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, pp. 78117818.CrossRefGoogle Scholar
Khashabi, D., Chaturvedi, S., Roth, M., Upadhyay, S. and Roth, D. (2018). Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). New Orleans, Louisiana: Association for Computational Linguistics, pp. 252262.CrossRefGoogle Scholar
Kingma, D.P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.Google Scholar
Korobov, M. (2015). Morphological analyzer and generator for Russian and Ukrainian languages. In International Conference on Analysis of Images, Social Networks and Texts AIST 2015: Analysis of Images, Social Networks and Texts, vol. 542, pp. 320332.CrossRefGoogle Scholar
Kovaleva, O., Romanov, A., Rogers, A. and Rumshisky, A. (2019). Revealing the dark secrets of BERT. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, pp. 43654374.CrossRefGoogle Scholar
Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L. and Schwab, D. (2020). FlauBERT: Unsupervised language model pre-training for French. In Proceedings of the 12th Language Resources and Evaluation Conference. Marseille, France: European Language Resources Association, pp. 24792490.Google Scholar
Lee, C., Cho, K. and Kang, W. (2019). Mixout: Effective regularization to finetune large-scale pretrained language models. arXiv preprint arXiv:1909.11299.Google Scholar
Liang, Y., Duan, N., Gong, Y., Wu, N., Guo, F., Qi, W., Gong, M., Shou, L., Jiang, D., Cao, G., Fan, X., Zhang, R., Agrawal, R., Cui, E., Wei, S., Bharti, T., Qiao, Y., Chen, J.-H., Wu, W., Liu, S., Yang, F., Campos, D., Majumder, R. and Zhou, M. (2020). XGLUE: A new benchmark datasetfor cross-lingual pre-training, understanding and generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, pp. 60086018.CrossRefGoogle Scholar
Liška, A., Kruszewski, G. and Baroni, M. (2018). Memorize or generalize? searching for a compositional rnn in a haystack. arXiv preprint arXiv:1802.06467.Google Scholar
Liu, Y., Gu, J., Goyal, N., Li, X., Edunov, S., Ghazvininejad, M., Lewis, M. and Zettlemoyer, L. (2020). Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics 8, 726742.CrossRefGoogle Scholar
Loshchilov, I. and Hutter, F. (2017). Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.Google Scholar
Madhyastha, P. and Jain, R. (2019). On model stability as a function of random seed. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL). Hong Kong, China: Association for Computational Linguistics, pp. 929939.CrossRefGoogle Scholar
Marelli, M., Menini, S., Baroni, M., Bentivogli, L., Bernardi, R. and Zamparelli, R. (2014). A SICK cure for the evaluation of compositional distributional semantic models. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14). Reykjavik, Iceland: European Language Resources Association (ELRA), pp. 216223.Google Scholar
McCoy, R.T., Frank, R, and Linzen, T. (2018). Revisiting the poverty of the stimulus: hierarchical generalization without a hierarchical bias in recurrent neural networks.Google Scholar
McCoy, R.T., Min, J. and Linzen, T. (2020). BERTs of a feather do not generalize together: Large variability in generalization across models with similar test set performance. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. Online: Association for Computational Linguistics, pp. 217227.CrossRefGoogle Scholar
McCoy, T., Pavlick, E. and Linzen, T. (2019). Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, pp. 34283448.CrossRefGoogle Scholar
Merchant, A., Rahimtoroghi, E., Pavlick, E. and Tenney, I. (2020). What happens to BERT embeddings during fine-tuning? In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. Online: Association for Computational Linguistics, pp. 3344.Google Scholar
Miaschi, A., Brunato, D., Dell’Orletta, F. and Venturi, G. (2020). Linguistic profiling of a neural language model. In Proceedings of the 28th International Conference on Computational Linguistics. Barcelona, Spain (Online): International Committee on Computational Linguistics, pp. 745756.CrossRefGoogle Scholar
Min, J., McCoy, R.T., Das, D., Pitler, E. and Linzen, T. (2020). Syntactic data augmentation increases robustness to inference heuristics. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, pp. 23392352.CrossRefGoogle Scholar
Mosbach, M., Andriushchenko, M. and Klakow, D. (2020a). On the stability of fine-tuning bert: Misconceptions, explanations, and strong baselines. In International Conference on Learning Representations.Google Scholar
Mosbach, M., Khokhlova, A., Hedderich, M.A. and Klakow, D. (2020b). On the interplay between fine-tuning and sentence-level probing for linguistic knowledge in pre-trained transformers. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. Online: Association for Computational Linguistics, pp. 6882.CrossRefGoogle Scholar
Naik, A., Ravichander, A., Sadeh, N., Rose, C. and Neubig, G. (2018). Stress test evaluation for natural language inference. In Proceedings of the 27th International Conference on Computational Linguistics. Santa Fe, New Mexico, USA: Association for Computational Linguistics, pp. 23402353.Google Scholar
Nie, Y., Williams, A., Dinan, E., Bansal, M., Weston, J. and Kiela, D. (2020). Adversarial NLI: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, pp. 48854901.CrossRefGoogle Scholar
Nisioi, S., Rabinovich, E., Dinu, L.P. and Wintner, S. (2016). A corpus of native, non-native and translated texts. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16). Portorož, Slovenia: European Language Resources Association (ELRA), pp. 41974201.Google Scholar
Phang, J., Févry, T. and Bowman, S.R. (2018). Sentence encoders on stilts: Supplementary training on intermediate labeled-data tasks. arXiv preprint arXiv:1811.01088.Google Scholar
Pruksachatkun, Y., Phang, J., Liu, H., Htut, P.M., Zhang, X., Pang, R.Y., Vania, C., Kann, K. and Bowman, S.R. (2020a). Intermediate-task transfer learning with pretrained language models: When and why does it work? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, pp. 52315247.CrossRefGoogle Scholar
Pruksachatkun, Y., Yeres, P., Liu, H., Phang, J., Htut, P. M., Wang, A., Tenney, I. and Bowman, S.R. (2020b). jiant: A software toolkit for research on general-purpose text understanding models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Online: Association for Computational Linguistics, pp. 109117.CrossRefGoogle Scholar
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W. and Liu, P.J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer.Google Scholar
Richardson, K., Hu, H., Moss, L. and Sabharwal, A. (2020). Probing natural language inference models through semantic fragments. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 87138721.CrossRefGoogle Scholar
Rogers, A. (2019). How the transformers broke nlp leaderboards.Google Scholar
Rogers, A. (2021). Changing the world by changing the data. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online: Association for Computational Linguistics, pp. 21822194.CrossRefGoogle Scholar
Rogers, A., Kovaleva, O. and Rumshisky, A. (2020). A primer in BERTology: What we know about how BERT works. Transactions of the Association for Computational Linguistics 8, 842866.CrossRefGoogle Scholar
Rybak, P., Mroczkowski, R., Tracz, J. and Gawlik, I. (2020). KLEJ: Comprehensive benchmark for Polish language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, pp. 11911201.CrossRefGoogle Scholar
Sanchez, I., Mitchell, J. and Riedel, S. (2018). Behavior analysis of NLI models: Uncovering the influence of three factors on robustness. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). New Orleans, Louisiana: Association for Computational Linguistics, pp. 19751985.CrossRefGoogle Scholar
Shavrina, T., Fenogenova, A., Anton, E., Shevelev, D., Artemova, E., Malykh, V., Mikhailov, V., Tikhonova, M., Chertok, A. and Evlampiev, A. (2020). RussianSuperGLUE: A Russian language understanding evaluation benchmark. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, pp. 47174726.CrossRefGoogle Scholar
Shavrina, T. and Shapovalova, O. (2017). To the methodology of corpus construction for machine learning:“taiga”. syntax tree corpus and parser. Corpus Linguistics 2017, p. 78.Google Scholar
Singh, J., Wallat, J. and Anand, A. (2020). BERTnesia: Investigating the capture and forgetting of knowledge in BERT. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. Online: Association for Computational Linguistics, pp. 174183.CrossRefGoogle Scholar
Storks, S., Gao, Q. and Chai, J.Y. (2019). Recent advances in natural language inference: A survey of benchmarks, resources, and approaches. arXiv preprint arXiv:1904.01172.Google Scholar
Tanchip, C., Yu, L., Xu, A. and Xu, Y. (2020). Inferring symmetry in natural language. In Findings of the Association for Computational Linguistics: EMNLP 2020. Online: Association for Computational Linguistics, pp. 28772886.CrossRefGoogle Scholar
Thawani, A., Pujara, J., Ilievski, F. and Szekely, P. (2021). Representing numbers in NLP: a survey and a vision. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association for Computational Linguistics, pp. 644656.CrossRefGoogle Scholar
Tsuchiya, M. (2018). Performance impact caused by hidden bias of training data for recognizing textual entailment. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) Miyazaki, Japan: European Language Resources Association (ELRA).Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł. and Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems (pp. 59986008).Google Scholar
Venhuizen, N.J., Hendriks, P., Crocker, M.W. and Brouwer, H. (2021). Distributional formal semantics. Information and Computation, p. 104763.Google Scholar
Wallace, E., Wang, Y., Li, S., Singh, S. and Gardner, M. (2019). Do NLP models know numbers? probing numeracy in embeddings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, pp. 53075315.CrossRefGoogle Scholar
Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O. and Bowman, S. (2019). Superglue: A stickier benchmark for general-purpose language understanding systems. In Advances in Neural Information Processing Systems, pp. 32663280.Google Scholar
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O. and Bowman, S. (2018). GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Brussels, Belgium: Association for Computational Linguistics, pp. 353355.CrossRefGoogle Scholar
Warstadt, A. and Bowman, S.R. (2019). Linguistic analysis of pretrained sentence encoders with acceptability judgments. arXiv preprint arXiv:1901.03438.Google Scholar
Weber, N., Shekhar, L. and Balasubramanian, N. (2018). The fine line between linguistic generalization and failure in Seq2Seq-attention models. In Proceedings of the Workshop on Generalization in the Age of Deep Learning. New Orleans, Louisiana: Association for Computational Linguistics, pp. 2427.CrossRefGoogle Scholar
Williams, A., Nangia, N. and Bowman, S. (2018). A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). New Orleans, Louisiana: Association for Computational Linguistics, pp. 11121122.CrossRefGoogle Scholar
Wu, J.M., Belinkov, Y., Sajjad, H., Durrani, N., Dalvi, F. and Glass, J. (2020). Similarity analysis of contextual word representation models. arXiv preprint arXiv:2005.01172.Google Scholar
Xu, L., Hu, H., Zhang, X., Li, L., Cao, C., Li, Y., Xu, Y., Sun, K., Yu, D., Yu, C., Tian, Y., Dong, Q., Liu, W., Shi, B., Cui, Y., Li, J., Zeng, J., Wang, R., Xie, W., Li, Y., Patterson, Y., Tian, Z., Zhang, Y., Zhou, H., Liu, S., Zhao, Z., Zhao, Q., Yue, C., Zhang, X., Yang, Z., Richardson, K. and Lan, Z. (2020). CLUE: A Chinese language understanding evaluation benchmark. In Proceedings of the 28th International Conference on Computational Linguistics. Barcelona, Spain (Online): International Committee on Computational Linguistics, pp. 47624772.CrossRefGoogle Scholar
Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou, R., Siddhant, A., Barua, A. and Raffel, C. (2021). mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association for Computational Linguistics, pp. 483498.CrossRefGoogle Scholar
Yanaka, H., Mineshima, K., Bekki, D. and Inui, K. (2020). Do neural models learn systematicity of monotonicity inference in natural language? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, pp. 61056117.Google Scholar
Yanaka, H., Mineshima, K., Bekki, D., Inui, K., Sekine, S., Abzianidze, L. and Bos, J. (2019a). Can neural networks understand monotonicity reasoning? In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Florence, Italy: Association for Computational Linguistics, pp. 3140.CrossRefGoogle Scholar
Yanaka, H., Mineshima, K., Bekki, D., Inui, K., Sekine, S., Abzianidze, L. and Bos, J. (2019b). HELP: A dataset for identifying shortcomings of neural models in monotonicity reasoning. In Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019). Minneapolis, Minnesota: Association for Computational Linguistics, pp. 250255.CrossRefGoogle Scholar
Zhang, S., Liu, X., Liu, J., Gao, J., Duh, K. and Durme, B.V. (2018). Record: Bridging the gap between human and machine commonsense reading comprehension.Google Scholar
Zhang, Y., Warstadt, A., Li, X., and Bowman, S.R. (2021). When do you need billions of words of pretraining data? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online: Association for Computational Linguistics, pp. 11121125.Google Scholar
Zhao, Y. and Bethard, S. (2020). How does BERT’s attention change when you fine-tune? an analysis methodology and a case study in negation scope. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, pp. 47294747.CrossRefGoogle Scholar
Zhu, C., Cheng, Y., Gan, Z., Sun, S., Goldstein, T. and Liu, J. (2019). Freelb: Enhanced adversarial training for natural language understanding. In International Conference on Learning Representations.Google Scholar
Zhuang, D., Zhang, X., Song, S.L. and Hooker, S. (2021). Randomness in neural network training: Characterizing the impact of tooling. arXiv preprint arXiv:2106.11872.Google Scholar
Figure 0

Table 1. Statistics of the NLI datasets. Vocab size refers to the total number of unique words. Num. of words stands for the average number of words in a sample. Fr = French; De = German; Sw = Swedish.

Figure 1

Table 2. The linguistic annotation of the diagnostic dataset.

Figure 2

Figure 1. Heat map of the mBERT’s language-wise evaluation on the diagnostic datasets. The brighter the color, the higher the MCC score.

Figure 3

Figure 2. MCC scores on the English diagnostic dataset for mBERT fine-tuned with multiple random seeds.

Figure 4

Table 3. Results of the fine-tuning stability experiments w.r.t. random seeds for each language. Overall MCC = overall MCC scores of each RS model averaged by the total number of RS models. RS corr. = pairwise Pearson’s correlation coefficients between the RS models’ MCC scores, averaged by the total number of random seed pairs.

Figure 5

Figure 3. Feature-wise heat maps of the performance patterns after fine-tuning on combined RTE and MNLI training datasets. Left: Delta between MCC scores (higher is better). Right: Delta between standard deviation values (lower is better).

Figure 6

Table 4. Results of the fine-tuning stability w.r.t using additional MNLI training samples in the cross-lingual transfer setting. Overall MCC = overall MCC scores of each RS model averaged by the total number of RS models. RS corr. = pairwise Pearson’s correlation coefficients between the RS models’ MCC scores, averaged by the total number of random seed pairs.

Figure 7

Figure 4. Results of the fine-tuning stability w.r.t. the number of additional MNLI training samples added to the RTE training data for English and Russian. Overall MCC = overall MCC scores of each RS model averaged by the total number of RS models. RS corr. = pairwise Pearson’s correlation coefficients between the RS models’ MCC scores, averaged by the total number of random seed pairs.

Figure 8

Figure 5. Distribution of the model MCC scores when fine-tuned on the combined data (RTE + 10k) as opposed to the standardized dataset size.

Figure 9

Table 5. Results of the fine-tuning stability w.r.t. varying degree of the feature distribution in the MNLI subset for English and Russian. Feature MCC = feature MCC score of each RS model averaged by the total number of RS models. RS corr. = pairwise Pearson’s correlation coefficients between the RS models’ MCC scores, averaged by the total number of random seed pairs.