1. Introduction
The latest advances in neural architectures of language models (LMs) (Vaswani et al., Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser and Polosukhin2017) have raised the importance of NLU benchmarks as a standardized practice of tracking progress in the field and exceeded conservative human baselines on some datasets (Raffel et al., Reference Raffel, Shazeer, Roberts, Lee, Narang, Matena, Zhou, Li and Liu2020; He et al., Reference He, Liu, Gao and Chen2021). Such LMs are centered around the “pre-train & fine-tune” paradigm, where a pretrained LM is directly fine-tuned for solving a downstream task. Despite the impressive empirical results, pretrained LMs struggle to learn linguistic phenomena from raw text corpora (Rogers Reference Rogers2021), even when increasing the size of pretraining data (Zhang et al., Reference Zhang, Warstadt, Li and Bowman2021). Furthermore, the fine-tuning procedure can be unstable (Devlin et al., Reference Devlin, Chang, Lee and Toutanova2019) and raise doubts about whether it promotes task-specific linguistic reasoning (Kovaleva et al., Reference Kovaleva, Romanov, Rogers and Rumshisky2019). The brittleness of standard fine-tuning approaches to various sources of randomness (e.g., weight initialization and training data order) can lead to different evaluation results and prediction confidences of models, independently fine-tuned under the same experimental setup. Recent research has defined this problem as (in)stability (Dodge et al., Reference Dodge, Ilharco, Schwartz, Farhadi, Hajishirzi and Smith2020); (Mosbach et al., Reference Mosbach, Andriushchenko and Klakow2020a), which now serves as a subject of an interpretation direction, aimed at exploring the consistency of linguistic generalization of LMs (McCoy et al., Reference McCoy, Frank and Linzen2018, Reference McCoy, Min and Linzen2020).
Our paper is devoted to this problem in the task of natural language inference (NLI) which has been widely used to assess language understanding capabilities of LMs in monolingual and multilingual benchmarks (Wang et al., Reference Wang, Singh, Michael, Hill, Levy and Bowman2018, Reference Wang, Pruksachatkun, Nangia, Singh, Michael, Hill, Levy and Bowman2019; Liang et al., Reference Liang, Duan, Gong, Wu, Guo, Qi, Gong, Shou, Jiang, Cao, Fan, Zhang, Agrawal, Cui, Wei, Bharti, Qiao, Chen, Wu, Liu, Yang, Campos, Majumder and Zhou2020; Hu et al., Reference Hu, Ruder, Siddhant, Neubig, Firat and Johnson2020b). The task is framed as a binary classification problem, where the model should predict if the meaning of the hypothesis is entailed with the premise. Many works show that NLI models learn shallow heuristics and spurious correlations in the training data (Naik et al., Reference Naik, Ravichander, Sadeh, Rose and Neubig2018; Glockner et al., Reference Glockner, Shwartz and Goldberg2018; Sanchez et al., Reference Sanchez, Mitchell and Riedel2018), stimulating a targeted evaluation of LMs on out-of-distribution sets covering inference phenomena of interest (Yanaka et al., Reference Yanaka, Mineshima, Bekki, Inui, Sekine, Abzianidze and Bos2019b; Yanaka et al., Reference Yanaka, Mineshima, Bekki, Inui, Sekine, Abzianidze and Bos2019a; McCoy et al., Reference McCoy, Pavlick and Linzen2019; Tanchip et al., Reference Tanchip, Yu, Xu and Xu2020). Although such datasets are extremely useful for analyzing how well LMs capture inference and abstract properties of language, English remains the focal point of the research, leaving other languages underexplored.
To this end, our work extends the ongoing research on the fine-tuning stability and consistency of linguistic generalization to the multilingual setting, covering five Indo-European languages from four language groups: English (West Germanic), Russian (Balto-Slavic), French (Romance), German (West Germanic), and Swedish (North Germanic). The contributions are summarized as twofold. First, we propose GLUE-style textual entailment and diagnostic datasetsFootnote a for French, Swedish, and German. Second, we explore the stability of linguistic generalization of mBERT across five languages mentioned above, analyzing the impact of the random seed choice, training dataset size, and presence of linguistic categories in the training data. Our work differs from similar approaches described in Section 2 in that we (i) evaluate the inference abilities through the lens of broad-coverage diagnostics, which is often neglected for upcoming LMs, typically compared among one another only by the averaged scores on canonical benchmarks (Dehghani et al., Reference Dehghani, Tay, Gritsenko, Zhao, Houlsby, Diaz, Metzler and Vinyals2021); and (ii) analyze the per-category stability of the model fine-tuning for the considered languages, testing mBERT’s cross-lingual transfer abilities.
2. Related work
NLI and diagnostic datasets. There is a wide variety of datasets constructed to facilitate the development of novel approaches to the problem of NLI (Storks et al., Reference Storks, Gao and Chai2019). The task has evolved within a series of RTE challenges (Dagan et al., Reference Dagan, Glickman and Magnini2005) and now comprises several standardized benchmark datasets such as SICK (Marelli et al., Reference Marelli, Menini, Baroni, Bentivogli, Bernardi and Zamparelli2014), SNLI (Bowman et al., Reference Bowman, Angeli, Potts and Manning2015), MNLI (Williams et al., Reference Williams, Nangia and Bowman2018), and XNLI (Conneau et al., Reference Conneau, Rinott, Lample, Williams, Bowman, Schwenk and Stoyanov2018b). Despite the rapid progress, recent work has found that these benchmarks may contain biases and annotation artifacts which raise questions whether state-of-the-art models indeed have or acquire the inference abilities (Tsuchiya Reference Tsuchiya2018; Belinkov et al., Reference Belinkov, Poliak, Shieber, Van Durme and Rush2019). Various linguistic datasets have been proposed to challenge the models and help to improve their performance on inference features (Glockner et al., Reference Glockner, Shwartz and Goldberg2018; Yanaka et al., Reference Yanaka, Mineshima, Bekki, Inui, Sekine, Abzianidze and Bos2019a, 2019b, 2020; McCoy et al., Reference McCoy, Pavlick and Linzen2019; Richardson et al., Reference Richardson, Hu, Moss and Sabharwal2020; Hossain et al., Reference Hossain, Kovatchev, Dutta, Kao, Wei and Blanco2020; Tanchip et al., Reference Tanchip, Yu, Xu and Xu2020). The MED (Yanaka et al., Reference Yanaka, Mineshima, Bekki, Inui, Sekine, Abzianidze and Bos2019a) and HELP (Yanaka et al., Reference Yanaka, Mineshima, Bekki, Inui, Sekine, Abzianidze and Bos2019b) datasets focus on aspects of monotonicity reasoning, motivating the follow-up work on systematicity of this phenomenon (Yanaka et al., Reference Yanaka, Mineshima, Bekki and Inui2020). HANS (McCoy et al., Reference McCoy, Pavlick and Linzen2019) aims at evaluating the generalization abilities of NLI models beyond memorizing lexical and syntactic heuristics in the training data. Similar in spirit, the concept of semantic fragments has been applied to synthesize datasets that target quantifiers, conditionals, monotonicity reasoning, and other features (Richardson et al., Reference Richardson, Hu, Moss and Sabharwal2020). The SIS dataset (Tanchip et al., Reference Tanchip, Yu, Xu and Xu2020) covers symmetry of verb predicates, and it is designed to improve systematicity in neural models. Another feature studied in the field is negation which has proved to be challenging not only for the NLI task (Hossain et al., Reference Hossain, Kovatchev, Dutta, Kao, Wei and Blanco2020; Hosseini et al., Reference Hosseini, Reddy, Bahdanau, Hjelm, Sordoni and Courville2021) but also for probing factual knowledge in masked LMs (Kassner and Schütze Reference Kassner and Schütze2020).
Last but not least, broad-coverage diagnostics is introduced in the GLUE benchmark (Wang et al., Reference Wang, Singh, Michael, Hill, Levy and Bowman2018) and has now become a standard dataset for examining linguistic knowledge of LMs on GLUE-style leaderboards. To the best of our knowledge, there are only two counterparts of the diagnostic dataset for Chinese and Russian, introduced in the CLUE (Xu et al., Reference Xu, Hu, Zhang, Li, Cao, Li, Xu, Sun, Yu, Yu, Tian, Dong, Liu, Shi, Cui, Li, Zeng, Wang, Xie, Li, Patterson, Tian, Zhang, Zhou, Liu, Zhao, Zhao, Yue, Zhang, Yang, Richardson and Lan2020) and Russian SuperGLUE benchmarks (Shavrina et al., Reference Shavrina, Fenogenova, Anton, Shevelev, Artemova, Malykh, Mikhailov, Tikhonova, Chertok and Evlampiev2020). Creating such datasets is not addressed in recently proposed GLUE-like benchmarks for Polish (Rybak et al., Reference Rybak, Mroczkowski, Tracz and Gawlik2020) and French (Le et al., Reference Le, Vial, Frej, Segonne, Coavoux, Lecouteux, Allauzen, Crabbé, Besacier and Schwab2020).
Stability of neural models. A growing body of recent studies has explored the role of optimization, data, and implementation choices on the stability of training and fine-tuning neural models (Henderson et al., Reference Henderson, Islam, Bachman, Pineau, Precup and Meger2018; Madhyastha and Jain Reference Madhyastha and Jain2019; Dodge et al., Reference Dodge, Ilharco, Schwartz, Farhadi, Hajishirzi and Smith2020; Mosbach et al., Reference Mosbach, Andriushchenko and Klakow2020a). Bhojanapalli et al., (Reference Bhojanapalli, Wilber, Veit, Rawat, Kim, Menon and Kumar2021) and Zhuang et al., (Reference Zhuang, Zhang, Song and Hooker2021) investigate the impact of weight initialization, mini-batch ordering, data augmentation, and hardware on the prediction disagreement between image classification models. In NLP, BERT has demonstrated instability when being fine-tuned on small datasets across multiple restarts (Devlin et al., Reference Devlin, Chang, Lee and Toutanova2019). This has motivated further research on the most contributing factors to such behavior, mostly the dataset size and the choice of random seed as a hyperparameter (Bengio Reference Bengio2012), which influences training data order and weight initialization. The studies report that changing only random seed during the fine-tuning stage can cause a significant standard deviation of the validation performance, including tasks from the GLUE benchmark (Lee et al., Reference Lee, Cho and Kang2019; Dodge et al., Reference Dodge, Ilharco, Schwartz, Farhadi, Hajishirzi and Smith2020; Mosbach et al., Reference Mosbach, Andriushchenko and Klakow2020a; Hua et al., Reference Hua, Li, Dou, Xu and Luo2021). Another direction involves studying the effect of random seeds on model performance and robustness in terms of attention interpretation and gradient-based feature importance methods (Madhyastha and Jain Reference Madhyastha and Jain2019).
Linguistic competence of BERT. A plethora of works is devoted to the linguistic analysis of BERT, and the inspection of how fine-tuning affects the model knowledge (Rogers et al., Reference Rogers, Kovaleva and Rumshisky2020). The research has covered various linguistic phenomena, including syntactic properties (Warstadt and Bowman Reference Warstadt and Bowman2019), structural information (Jawahar et al., Reference Jawahar, Sagot and Seddah2019), semantic knowledge (Goldberg Reference Goldberg2019), common sense (Cui et al., Reference Cui, Cheng, Wu and Zhang2020), and many others (Ettinger Reference Ettinger2020). Contrary to the common understanding that BERT can capture the language properties, some studies reveal that the model tends to lose the information after fine-tuning (Miaschi et al., Reference Miaschi, Brunato, Dell’Orletta and Venturi2020); (Singh et al., Reference Singh, Wallat and Anand2020); (Mosbach et al., Reference Mosbach, Khokhlova, Hedderich and Klakow2020b) and fails to acquire task-specific linguistic reasoning (Kovaleva et al., Reference Kovaleva, Romanov, Rogers and Rumshisky2019); (Zhao and Bethard Reference Zhao and Bethard2020); (Merchant et al., Reference Merchant, Rahimtoroghi, Pavlick and Tenney2020). Several works explore the consistency of linguistic generalization of neural models by independently training them from 50 to 5,000 times and evaluating their generalization performance (Weber et al., Reference Weber, Shekhar and Balasubramanian2018; Liška et al., Reference Liška, Kruszewski and Baroni2018; McCoy et al., Reference McCoy, Frank and Linzen2018; McCoy et al., Reference McCoy, Min and Linzen2020). In the spirit of these studies, we analyze the stability of the mBERT model w.r.t. diagnostic inference features, extending the experimental setup to the multilingual setting.
3. Multilingual datasets
This section describes textual entailment and diagnostic datasets for five Indo-European languages: English (West Germanic), Russian (Balto-Slavic), French (Romance), German (West Germanic), and Swedish (North Germanic). We use existing datasets for English (Wang et al., Reference Wang, Pruksachatkun, Nangia, Singh, Michael, Hill, Levy and Bowman2019) and Russian (Shavrina et al., Reference Shavrina, Fenogenova, Anton, Shevelev, Artemova, Malykh, Mikhailov, Tikhonova, Chertok and Evlampiev2020) and propose their counterparts for the other languages based on the GLUE-style methodology (Wang et al., Reference Wang, Singh, Michael, Hill, Levy and Bowman2018).
3.1 Recognizing textual entailment
The task of recognizing textual entailment is framed as a binary classification problem, where the model should predict if the meaning of the hypothesis is entailed with the premise. We provide an example from the English RTE dataset below and describe brief statistics for each language in Table 1.
-
• Premise: ‘Dana Reeve, the widow of the actor Christopher Reeve, has died of lung cancer at age 44, according to the Christopher Reeve Foundation.’
-
• Hypothesis: ‘Christopher Reeve had an accident.’
-
• Entailment: False.
English: RTE (Wang et al., Reference Wang, Singh, Michael, Hill, Levy and Bowman2018) is a collection of datasets from a series of competitions on recognizing textual entailment, constructed from news and Wikipedia (Dagan et al., Reference Dagan, Glickman and Magnini2005; Haim et al., Reference Haim, Dagan, Dolan, Ferro, Giampiccolo, Magnini and Szpektor2006; Giampiccolo et al., Reference Giampiccolo, Magnini, Dagan and Dolan2007; Bentivogli et al., Reference Bentivogli, Clark, Dagan and Giampiccolo2009).
Russian: Textual Entailment Recognition for Russian (TERRa) (Shavrina et al., Reference Shavrina, Fenogenova, Anton, Shevelev, Artemova, Malykh, Mikhailov, Tikhonova, Chertok and Evlampiev2020) is an analog of the RTE dataset that consists of sentence pairs sampled from news and fiction segments of the Taiga corpus (Shavrina and Shapovalova Reference Shavrina and Shapovalova2017).
French, German, Swedish: Each sample from TERRa is manually translated and verified by professional translators with the linguistic peculiarities preserved, culture-specific elements localized, and ambiguous samples filtered out. The resulting datasets contain fewer unique words than the ones constructed by filtering text sources (RTE and TERRa). We relate this to the fact that translated texts may exhibit less lexical diversity and vocabulary richness (Al-Shabab Reference Al-Shabab1996; Nisioi et al., Reference Nisioi, Rabinovich, Dinu and Wintner2016).
3.2 Broad-coverage diagnostics
Broad-coverage diagnostics (Wang et al., Reference Wang, Singh, Michael, Hill, Levy and Bowman2018) is an expert-constructed evaluation dataset that consists of 1104 NLI sentence pairs annotated with linguistic phenomena under four high-level categories (see Table 2). The dataset is originally included in the GLUE benchmark. It is used as an additional test set for examining the linguistic competence of LMs, which allows for revealing possible biases and conducting a systematic analysis of the model behavior.
As part of this study, LiDiRus (Linguistic Diagnostics for Russian), an equivalent diagnostic dataset for the Russian language, is created (Shavrina et al., Reference Shavrina, Fenogenova, Anton, Shevelev, Artemova, Malykh, Mikhailov, Tikhonova, Chertok and Evlampiev2020). The creation procedure includes a manual translation of the English diagnostic samples by expert linguists so that each indicated linguistic phenomenon and target label is preserved and culture-specific elements are localized. We apply the same procedure to construct diagnostic datasets for French, German, and Swedish by translating and localizing the English diagnostic samples. The label distribution in each dataset is 42/58% (Entailment: True/False). Consider an example of the NLI pair (Sentence 1: ‘John married Gary’; Sentence 2: ‘Gary married John’; Entailment: True) and its translation in each language:
-
• English: ‘John married Gary’ entails ‘Gary married John’;
-
• Russian: ‘’ entails ‘’;
-
• French: ‘John a épousé entails ‘Gary a épousé John’;
-
• German: ‘John heiratete Gary’ entails ‘Gary heiratete John’;
-
• Swedish: ‘John gifte sig med Gary’ entails ‘Gary gifte sig med John’.
Linguistic challenges. Special attention is paid to the problems of the feature-wise translation of the examples. Since the considered languages are Indo-European, there appear fewer translation challenges. For instance, all languages have morphological negation mechanisms, lexical semantics features, common sense, and world knowledge instances. The main distinctions are related to the category of the Predicate-Argument Structure. The strategy of case coding is exhibited differently across the languages, for example, in dative constructions. Dative was widely used in all ancient Indo-European languages and is still present in modern Russian, retaining numerous functions. In contrast, dative constructions are primarily underrepresented in English and Swedish, and all the dative examples in the translations involve impersonal constructions with an indirect object instead of a subject. The same goes for genitives and partitives, where standard noun phrase syntax indicates genitive relations as Swedish and English do not have case marking. For French, the “de + noun” constructions are used to indicate partitiveness or genitiveness. Below is an example of an English sentence and its corresponding translations to Swedish and French:
-
• English: ‘A formation of approximately 50 officers of the police of the City of Baltimore eventually placed themselves between the rioters and the militiamen, allowing the 6th Massachusetts to proceed to Camden Station.’;
-
• Swedish: ‘Om 50 poliser i staden Baltimore, i slutändan stod mellan demonstranterna och brottsbekämpande myndigheter, vilket gjorde det möjligt för 6: e Massachusetts Volunteer Regiment går till Cadman station.’;
-
• French: ‘Une cinquantaine de policiers de Baltimore se sont finalement interposés entre les manifestants et les forces de l’ordre, permettant au 6e régiment de volontaires du Massachusetts de se rendre à Cadman Station.’.
Translations for the Logic and Knowledge categories are obtained with no difficulty, for example, all existential constructions share patterns with the translated analogs of the quantifiers such as “some,” “many,” etc. However, we acknowledge that some low-level categories cannot be forwardly translated. For example, elliptic structures, are in general, quite different in Russian than in the other languages. Despite this, the translation-based method avoids the need for additional language-specific expert annotation.
4. Experimental setup
The experiments are conducted on the mBERTFootnote b model, pretrained on concatenated monolingual Wikipedia corpora in 104 languages. We use the SuperGLUE framework under the jiant environment (Pruksachatkun et al., Reference Pruksachatkun, Yeres, Liu, Phang, Htut, Wang, Tenney and Bowman2020b) to fine-tune the model multiple times for each language with a fixed set of hyperparameters while changing only the random seeds.
Fine-tuning. We follow the SuperGLUE fine-tuning and evaluation strategy with a set of default hyperparameters as follows. We fine-tune the mBERT model using a random seed $\in [0; 5]$ , batch size of 4, learning rate of $1e^{-5}$ , global gradient clipping, dropout probability of $p=0.1$ , and the AdamW optimizer (Loshchilov and Hutter Reference Loshchilov and Hutter2017). The fine-tuning is performed on 4 ChristofariFootnote c Tesla V100 GPUs (32GB) for the maximum number of 10 epochs with early stopping on the NLI validation data. The model is evaluated on the corresponding broad-coverage diagnostics dataset as described below.
Evaluation. Since the feature distribution and class ratio in the diagnostic set are not balanced, the model performance is evaluated with Matthew’s correlation coefficient (MCC), the two-class variant of the $R_3$ metric (Gorodkin Reference Gorodkin2004):
MCC is computed between the array of model predictions and the array of gold labels (Entailment: True/False) for each low-level linguistic feature according to the annotation (Wang et al., Reference Wang, Pruksachatkun, Nangia, Singh, Michael, Hill, Levy and Bowman2019). The range of values is $[-1; 1]$ (higher is better).
Fine-tuning stability. Fine-tuning stability has multiple definitions in recent research. The majority of studies estimate the stability as the standard deviation of the validation performance, measured by accuracy, MCC, or F1-score (Phang et al., Reference Phang, Févry and Bowman2018; Lee et al., Reference Lee, Cho and Kang2019; Dodge et al., Reference Dodge, Ilharco, Schwartz, Farhadi, Hajishirzi and Smith2020). Another possible notion is per-point stability, where a set of models is analyzed w.r.t. their predictions on the same evaluation sample (Mosbach et al., Reference Mosbach, Andriushchenko and Klakow2020a; McCoy et al., Reference McCoy, Pavlick and Linzen2019). More recent works evaluate the stability by more granular measures, such as predictive divergence, L2 norm of the trained weights, and standard deviation of subgroup validation performance (Zhuang et al., Reference Zhuang, Zhang, Song and Hooker2021). This work analyzes the stability in terms of pairwise Pearson’s correlation as follows. Given a fixed experimental setup, we compute the correlation coefficients between the MCC scores on the diagnostic datasets, achieved by the models trained with different random seeds, and average the coefficients by the total number of models (higher is better). Besides, we assess the per-category stability, that is, the standard deviation in the model performance w.r.t. random seeds for samples within a particular diagnostic category.
5. Testing the linguistic knowledge and fine-tuning stability
5.1 Language-wise diagnostics
We start with investigating how well the linguistic properties are learned given the standardized NLI dataset by fine-tuning the mBERT model on the corresponding train data for each language independently with the same hyperparameters and computing overall MCC by averaging MCC scores for each diagnostic feature. Figure 1 shows a language-wise heat map with the results we use as a “baseline” performance to analyze different experiment settings. Despite the fact that the overall MCC scores are insignificantly different from one another (e.g., German: $0.15$ , English: $0.2$ ), there is variability in how the model outputs correlate with the linguistic features w.r.t. the languages. In order to measure this variability, we compute pairwise Pearson’s correlation between the overall MCC scores and average the coefficients over the total number of language pairs. The resulting Pearson’s correlation is $0.3$ , which denotes that the knowledge obtained during fine-tuning predominantly varies across the languages, and there is no general pattern in the model behavior. For instance, Conditionals contribute to the correct predictions for English (MCC = $0.6$ ), slightly lower for French (MCC = $0.27$ ), are neutral for German (MCC = $0.09$ ) and do not help to solve the task for Russian (MCC = $-0.31$ ) and Swedish (MCC = $-0.25$ ). On the other hand, some features receive similar MCC scores for specific languages, such as Active/Passive (English: MCC = $0.38$ ; French: MCC = $0.38$ ; Russian: MCC = $0.26$ ; Swedish: MCC = $0.24$ ), Anaphora/Coreference (French: MCC = $0.21$ ; German: MCC = $0.21$ ; Russian: MCC = $0.26$ ), Common sense (French: MCC = 0; German: MCC = 0; Swedish: MCC = 0), Datives (German: MCC = $0.34$ ; Russian: MCC = $0.38$ ; Swedish: MCC = $0.34$ ), Genitives/Partitives (English: MCC = 0; French: MCC = $0.036$ ; German: MCC = 0), and Symmetry/Collectivity (English: MCC = $-0.12$ ; French: MCC = $-0.17$ ; German: MCC = $-0.17$ ).
5.2 Fine-tuning stability and random seeds
We fine-tune the mBERT model multiple times while changing only the random seeds $\in [0;5]$ for each considered language as described in Section 4. Figure 2 shows the seed-wise results for English. The results for the other languages are presented in Appendix 8.1. The overall pattern is that the correlation of the fine-grained diagnostic features and model outputs varies w.r.t. the random seed. Namely, some features demonstrate a large variance in the MCC score over different random seeds, for example, Conditionals (English: MCC = $0.6$ [0]; MCC = $0.13$ [1, 4, 5]), Nominalization (English: MCC = $0.46$ [0]; MCC = $0.46$ [1, 3, 4, 5]), Datives (French: MCC = $0.64$ [4]; MCC = $0.76$ [5]; MCC = 0 [1, 3]), Non-monotone (French: MCC = 0 [0, 2]; MCC = $-0.58$ [4]; MCC = $0.21$ [5]), Genitives/Partitives (German: MCC = 0 [0, 1]; MCC = $0.56$ [2]; MCC = $-0.29$ [4]), Restrictivity (Russian: MCC = $0.12$ [0, 2, 5]; MCC = 0 [3, 4]; MCC = $-0.65$ [1]), and Redundancy (Swedish: MCC = $0.34$ [2]; MCC = 0 [3]; MCC = $0.8$ [5]). On the one hand, a number of features positively correlates with the model predictions regardless the random seed, such as Core args, Intersectivity, Prepositional phrases, Datives (English); Active/Passive, Existential, Upward monotone (French); Anaphora/Coreference and Universal (German); Factivity and Redundancy (Russian); Symmetry/Collectivity and Upward monotone (Swedish). Some features, on the other hand, predominantly receive negative MCC scores: Disjunction and Intervals/Numbers (English), Symmetry/Collectivity (French and Russian), Coordination scope and Double negation (German), Conditionals and Temporal (Swedish). Table 3 aggregates the results of the seed-wise diagnostic evaluation for each language. While overall MCC scores within each language insignificantly differ, the mBERT model still have a weak correlation with the linguistic properties. Besides, the pairwise Pearson’s correlation coefficients between the RS modelsFootnote d vary between languages up to $0.22$ , which specifies that fine-tuning stability of the mBERT model is dependent upon language.
Table 6. (see Appendix 8.1) presents granular results of the per-category fine-tuning stability of the mBERT model for each language. We now describe the categories that have received the less and most significant standard deviations in the MCC scores over multiple random seeds. For most of the languages, the most stable categories are Common sense ( $\sigma \in [0.04;\; 0.09]$ ) and Factivity ( $\sigma \in [0.04;\; 0.1$ ]), while the most unstable ones are the categories of the Lexical Semantics, Logic and Predicate-Argument Structure, for example, Genitives/Partitives ( $\sigma \in [0.17;\; 0.31]$ ), Datives ( $\sigma \in [0.12;\; 0.34]$ ), Restrictivity ( $\sigma \in [0.04;\; 0.3]$ ), and Redundancy ( $\sigma \in [0.16;\; 0.32]$ ). The variance in the performance indicates the inconsistency of the linguistic generalization on a certain group of categories both collectively and discretely for the languages.
5.3 Fine-tuning stability and dataset size
Recent and contemporaneous studies report that a small number of training samples leads to unstable fine-tuning of the BERT model (Devlin et al., Reference Devlin, Chang, Lee and Toutanova2019; Phang et al., Reference Phang, Févry and Bowman2018; Zhu et al., Reference Zhu, Cheng, Gan, Sun, Goldstein and Liu2019; Pruksachatkun et al., Reference Pruksachatkun, Phang, Liu, Htut, Zhang, Pang, Vania, Kann and Bowman2020a; Dodge et al., Reference Dodge, Ilharco, Schwartz, Farhadi, Hajishirzi and Smith2020). Toward that end, we conduct two experiments to investigate how additional training data impacts the fine-tuning stability in the cross-lingual transfer setting and how it changes while the number of training samples gradually increases. We use the MNLI (Williams et al., Reference Williams, Nangia and Bowman2018) dataset for English and collapse “neutral” and “contradiction” samples into the “not entailment” label to meet the format of the RTE task (Wang et al., Reference Wang, Pruksachatkun, Nangia, Singh, Michael, Hill, Levy and Bowman2019). The resulting number of the additional training samples is 374k which are added to each language’s corresponding RTE training data.
Does extra data in English improve stability for all languages? To analyze the performance patterns, we compute deltas between the feature-wise MCC scores and standard deviation values ( $\sigma$ ) when using a single RTE training dataset (see Section 5.2) and a combination of the RTE and MNLI training datasets. Figure 3 shows heat maps of how the fine-tuning stability has changed after fine-tuning on the additional data. We find that the MCC scores have increased for 32% categories among all languages on average (delta between the MCC scores is more than $0.1$ ). The per-category fine-tuning stability has improved for 34% of categories among all languages on average (delta between the $\sigma$ values is below $-0.05$ )Footnote e. An interesting observation is that some categories receive confident performance improvements for all languages (the MCC delta is above $0.2$ ). Such categories include Conjunction, Coordination scope, Genitives/Partitives, Non-monotone, Prepositional phrases, Redundancy, and Relative clauses. However, the additional data does not help for learning the Disjunction and Downward monotone categories and even hurts the performance as opposed to the results in Section 5.2. We also find that 61% of categories for Russian have the $\sigma$ deltas below $-0.05$ , indicating that the per-category stability can be greatly improved by extending the training data with examples in the English language.
Table 4 presents the results of this setting with a comparison to the previous experiments where the model is fine-tuned on the standardized train data size with multiple random seeds (see Section 5.1 and 5.2). The overall trend is that extension of the RTE training data with the MNLI samples helps to improve the fine-tuning stability for each language. Overall MCC scores for the diagnostic features have increased from $0.177$ to $0.263$ on average (up by 49%), and the average standard deviation decreased by $0.166$ . Analyzing the impact on the fine-tuning stability w.r.t. random seed (see Appendix 8.2), we observe that variance in the MCC scores between the RS models has predominantly decreased for all languages. Moreover, pairwise Pearson’s correlation coefficients between the RS models have improved from $0.509$ to $0.837$ on average (up by 64%).
How many training samples are required for stability? To investigate the fine-tuning stability in the context of the training data size, we fine-tune the mBERT model as described in Section 4, while changing random seed $\in [0;\; 5]$ and gradually adding the MNLI samples $\in [1k, 5k, 10k, 50k, 100k, 200k, 250k, 374k]$ to the RTE training data for English and Russian. Figure 4 shows the results of this experiment. Despite the fact that the overall MCC scores stop increasing at the size of $RTE + 10k$ for both languages, the RS corr. is steadily improving, indicating a smaller variance in the MCC scores between the RS models. Besides, the model needs more data to improve the stability for Russian (recall that we add extra data in English).
5.4 Fine-tuning stability and presence of linguistic categories
We conduct the following experiment to investigate the relationship between the fine-tuning stability and particular diagnostic categories in the training data. We design a rule-based pipeline for annotating 15 out of 33 diagnostic features for English and Russian. Then, we evaluate the model depending on their presence percentage in the corresponding RTE training dataset combined with 10k training samples from MNLI (this amount of extra data is selected based on the results in Section 5.3.).
Description of annotation pipeline. Our study suggests that annotation of low-level diagnostic categories can be partially automatized based on features expressed lexically or grammatically. Lexical Semantics can be detected by the presence of quantifiers, negation morphemes, factivity verbs, and proper nouns. Logic features can be expressed with the indicators of temporal relations (mostly prepositions, conjunctions, particles, and deictic words), negation, and conditionals. Features from the Predicate-Argument Structure category can be identified with pronouns and syntactic tags (e.g., Relative clauses, Datives, etc.). However, Knowledge categories cannot be obtained in this manner.
Such approach relies only on the surface representation of the feature and is limited by the coverage of the predefined rules, thus giving space to false-negative results. Keeping this in mind, we construct a set of linguistic heuristics to identify the presence of a particular feature based on the morphosyntactic and NER annotation with spaCyFootnote f for English, and built-in dictionaries and morphological analysis with pymorphy2 for Russian (Korobov Reference Korobov2015). We also construct specific word lists for most of the features for both languages, for example, “all,” “some,” “every,” “any,” “anyone,” “everyone,” “nothing,” etc. (Quantifiers). The heuristics for the Russian language have several differences. For instance, dative constructions are detected by the morphological analysis of the nouns or pronouns, as the case is explicitly expressed in the flexion.
Stability and category distribution. We use the pipeline to annotate each training sample from RTE, TERRa, and the MNLI 10k subset. Table 7 presents the feature distributions for the datasets (see Appendix 8.3). Figure 5 depicts the model performance trajectories when fine-tuned on the combined data as opposed to the standardized dataset size (see Section 5.1). The behavior is predominantly similar for both languages, and there is a strong correlation of $0.94$ between the MCC performance improvements. We select four features for further analysisFootnote g: Conjunction (the MCC score improved for both languages), Anaphora/Coreference (there is a significant difference in the feature distribution between RTE and MNLI, and no such difference between TERRa and MNLI), Negation (the MCC score decreased for both languages, and the feature distribution differs between the languages), and Disjunction (the MCC score decreased for both languages). For each considered feature, we construct three controllable subsets with a varying percentage of the presence in the training data. We follow the same fine-tuning and evaluation strategy (see Section 4), changing random seed $\in [0; 5]$ and the feature percentage presence $\in [25, 50, 75]$ . Table 5 presents the results of the experiment. The general pattern observed for both languages is that adding more feature-specific training samples may rather hurt the fine-tuning stability along with the MCC score for the feature.
Feature MCC. The highest MCC scores for English are achieved when adding 50% (Conjunction, Negation), or 75% extra samples (Anaphora/Coreference, Disjunction). In contrast, this amount of data has decreased the MCC performance for Russian (Conjunction, Negation). Instead, the minimum number of 25% additional samples are required to receive the best MCC scores for the categories of Conjunction and Disjunction. Negation obtains an insignificant improvement when adding 75% samples, and Anaphora/Coreference is of $0.223$ MCC at 50% extra data.
Fine-tuning stability. Despite the fact that the feature MCC scores may increase, the fine-tuning stability may decrease for the identical amounts of additional training samples, for example, Conjunction (English and Russian), Negation (Russian), Anaphora/Coreference (English), and Disjunction (English and Russian). The minor variance between the RS models is predominantly the 25% or 50% extra data size for both languages.
Probing analysis. To analyze from another perspective, we apply the annotation pipeline to construct three probing tasks, aimed at identifying the presence of categories of Logic, Lexical Semantics, and Predicate-Argument structure. More details can be found in Appendix 8.4.
6. Discussion
Acquiring linguistic knowledge through NLI. A thorough language-wise analysis using the proposed multilingual datasets reveals how well the model learns the phenomena it is intended to learn for solving the NLI task. Despite the variability in the MCC performance, mBERT shows a similar behavior on a number of features on the languages that differ in their richness of morphology and syntax (see Section 5.1). Specifically, the model outputs are positively correlated with the following diagnostic categories that reflect the language peculiarities: Logic (Upward monotone, Conditionals, Existential, Universal, and Conjunction), Lexical semantics (Named entities), and Predicate-Argument structure (Ellipsis, Coordination scope, and Anaphora/Coreference). On the contrary, there is a number of features that predominantly receive negative MCC scores: Logic (Disjunction, Downward monotone, and Intervals/Numbers) and Predicate-Argument structure (Restrictivity). The Logic features are reminiscent of the properties of formal semantics, which captures the meaning of linguistic expressions through their logical interpretation utilizing formal models (Venhuizen et al., Reference Venhuizen, Hendriks, Crocker and Brouwer2021). Monotonicity (Upward/Downward monotone), as one of such features, covers various systematic patterns and allows for assessing inferential systematicity in natural languages. In line with (Yanaka et al., Reference Yanaka, Mineshima, Bekki, Inui, Sekine, Abzianidze and Bos2019b), our results show that the model generally struggles to learn the Downward monotone inferences with Disjunction for all languages. Another phenomenon to which mBERT is insensitive is the category of Negation. The model outputs weakly correlate with the true labels when the sample contains Negation, Double negation, and Morphological negation, indicating that the model fails to infer this core construction, which is a well-studied problem in the field (Naik et al., Reference Naik, Ravichander, Sadeh, Rose and Neubig2018; Ettinger Reference Ettinger2020; Hosseini et al., Reference Hosseini, Reddy, Bahdanau, Hjelm, Sordoni and Courville2021). Recently, Wallace et al., (Reference Wallace, Wang, Li, Singh and Gardner2019) have shown that it is difficult for contextualized LMs to generalize beyond the numerical values seen during training, and various datasets and model improvements have been proposed to analyze and enhance the understanding of numeracy (Thawani et al., Reference Thawani, Pujara, Ilievski and Szekely2021). The results for the category Intervals/Numbers in the context of the NLI problem reveal that numerical reasoning does not correlate with the expected model behavior (German and Russian) and even confuses the model (English, French, and Swedish). We also find that the results for the category of Symmetry/Collectivity (Lexical Semantics) vary between the considered languages, achieving negative MCC scores for most of them (English, French, and German). We relate this to the fact that the model may overly rely on the knowledge about entities and relations between them, refined from the pretraining corpora, so that linguistic expressions of the features are ignored (Tanchip et al., Reference Tanchip, Yu, Xu and Xu2020; Kassner and Schütze Reference Kassner and Schütze2020). Last but not least, we find that broadly defined categories such as Common sense and World knowledge do not show a significant correlation for all analyzed languages.
Comparing our results with the diagnostic evaluation of Chinese Transformer-based models on the NLI task (Xu et al., Reference Xu, Hu, Zhang, Li, Cao, Li, Xu, Sun, Yu, Yu, Tian, Dong, Liu, Shi, Cui, Li, Zeng, Wang, Xie, Li, Patterson, Tian, Zhang, Zhou, Liu, Zhao, Zhao, Yue, Zhang, Yang, Richardson and Lan2020), we observe the following similar trendsFootnote h. Consistent with our findings, Common sense and Monotonicity appear to be quite challenging to learn. However, the results for low-level categories that fall under Predicate-Argument Structure might differ. While the Chinese LMs achieve an average accuracy score of 58% on this category, mBERT has a hard time dealing with Nominalization or Restrictivity but tends to learn Coordination scope, Prepositional phrases, and Genitives/Partitives. At the same time, predictions of mBERT weakly correlate with Double negation, but the Chinese models receive an average accuracy score of 60%. Similarly, Lexical semantics is one of the best-learned Chinese categories; however, the mBERT model does not demonstrate a consistent behavior on the corresponding low-level categories. A more detailed investigation of cross-lingual LMs on these typologically diverse languages may shed light on how the models learn linguistic properties crucial for the NLI task and provide more insights on the cross-lingual transfer of language-specific categories and markers (Hu et al., Reference Hu, Zhou, Tian, Zhang, Patterson, Li, Nie and Richardson2021).
The impact of random seeds. Our results are consistent with McCoy et al., (Reference McCoy, Min and Linzen2020) who find that the instances of BERT fine-tuned on MNLI vary widely in their performance on the HANS dataset. In our work, the examination of the mBERT’s performance on the diagnostic datasets reveals a significant variance in the MCC scores and standard deviation w.r.t. random seeds for the majority of considered languages (see Section 5.2, Appendix 8.1). We observe significant standard deviations in the diagnostic performance, which indicates both per-language and per-category fine-tuning instability of the mBERT model. The findings highlight the importance of evaluating models on multiple restarts, as the scores obtained by a single model instance may not extrapolate to other instances, specifically in the multilingual benchmarks such as XGLUE (Liang et al., Reference Liang, Duan, Gong, Wu, Guo, Qi, Gong, Shou, Jiang, Cao, Fan, Zhang, Agrawal, Cui, Wei, Bharti, Qiao, Chen, Wu, Liu, Yang, Campos, Majumder and Zhou2020) and XTREME (Hu et al., Reference Hu, Ruder, Siddhant, Neubig, Firat and Johnson2020b). Namely, the features that are crucial for diagnostic analysis of LMs might not be appropriately learned by a particular instance, which may underscore their generalization abilities on the canonical leaderboards or even question whether LMs are indeed capable of capturing them either from pretraining or fine-tuning data. The statements are supported by the probing analysis, which shows that fine-tuning of mBERT on the RTE tasks with varying random seeds may unpredictably affect the model’s knowledge (see Appendix 8.4). Specifically, the effect can be abstracted as twofold: fine-tuned mBERT model either “forget” about a peculiar linguistic category, or “acquire” the uncertain knowledge which is demonstrated by sharp increases and decreases in the probe performance over several languages (Singh et al., Reference Singh, Wallat and Anand2020).
The impact of dataset size and feature proportions. Prior studies have reported contradictory results about the effect of adding/augmenting training data on the linguistic generalization and inference capabilities of LMs. Some works demonstrate that counterfactually augmented data does not yield generalization improvements on the NLI task (Huang et al., Reference Huang, Liu and Bowman2020). However, most recent studies show that fine-tuning BERT on additional NLI samples that cover particular inference features improves their understanding while retaining or increasing the downstream performance on NLI benchmarks (Yanaka et al., 2020, Reference Yanaka, Mineshima, Bekki, Inui, Sekine, Abzianidze and Bos2019b; Richardson et al., Reference Richardson, Hu, Moss and Sabharwal2020; Min et al., Reference Min, McCoy, Das, Pitler and Linzen2020; Hosseini et al., Reference Hosseini, Reddy, Bahdanau, Hjelm, Sordoni and Courville2021). Besides, the proportion of the features in the training data can be crucial for the model performance (Yanaka et al., Reference Yanaka, Mineshima, Bekki, Inui, Sekine, Abzianidze and Bos2019a). One of the closely related works by (Hu et al., Reference Hu, Zhou, Tian, Zhang, Patterson, Li, Nie and Richardson2021) tests cross-lingual transfer abilities of XLM-R (Conneau et al., Reference Conneau, Khandelwal, Goyal, Chaudhary, Wenzek, GuzmÁn, Grave, Ott, Zettlemoyer and Stoyanov2020) on the NLI task for Chinese, exploring configurations of fine-tuning the model on combinations of Chinese and English data and evaluating it on diagnostic datasets. Particularly, the model achieves the best performance when fine-tuned on concatenated OCNLI (Hu et al., Reference Hu, Richardson, Xu, Li, Kübler and Moss2020a) and English NLI datasets (e.g., Bowman et al., Reference Bowman, Angeli, Potts and Manning2015; Williams et al., Reference Williams, Nangia and Bowman2018; Nie et al., Reference Nie, Williams, Dinan, Bansal, Weston and Kiela2020) on the majority of covered diagnostic features, including uniquely Chinese ones: Idioms, Non-core argument, Pro-drop, Time of event, Anaphora, Argument structure, Comparatives, Double negation, Lexical semantics, and Negation. The results suggest that XLM-R can learn meaningful linguistic representations beyond surface properties and even strengthen the knowledge with the transfer from English, outperforming its monolingual counterparts.
Consistent with the latter studies, we find that extra data only in English provides better generalization capabilities of mBERT for all considered languages, which differ in their peculiarities of morphology and syntax. We also observe that using additional English data improves the fine-tuning stability, resulting in lower standard deviation values and higher Pearson’s correlation between the model instances’ scores (see Section 5.3). Another finding is that the number of training examples containing a particular feature might be critical for both diagnostic performance and fine-tuning stability of the mBERT model (see Section 5.4).
Limitations. The concept of benchmarking has become a standard paradigm for evaluating LMs against one another and human solvers, and dataset design protocols for the other languages are generally reproduced from English. However, there are still several methodological concerns, one of which is the dataset design and annotation choices (Rogers Reference Rogers2019; Dehghani et al., Reference Dehghani, Tay, Gritsenko, Zhao, Houlsby, Diaz, Metzler and Vinyals2021). It should be noted that a relatively small number of dataset samples has a common basis in benchmarking due to expensive annotation or the need for expert competencies. Unlike datasets for machine-reading comprehension, such as MultiRC (Khashabi et al., Reference Khashabi, Chaturvedi, Roth, Upadhyay and Roth2018) and ReCoRD (Zhang et al., Reference Zhang, Liu, Liu, Gao, Duh and Durme2018), the GLUE-style datasets for learning choice of alternatives, logic, and causal relationships are often represented by a smaller number of manually collected and verified samples. They are by design sufficient for the human type of generalization but often pose a challenge for the tested LMs. The broad-coverage diagnostic dataset is standard practice for assessing linguistic generalization of LMs. Nevertheless, it contains 1104 samples, and the number of samples for certain features includes only 14 samples (Universal and Existential). These dataset design choices might not provide an opportunity for a fair comparison and reliable interpretation of LMs, which might be supported by bootstrap techniques or construction of evaluation sets balanced by the number of analyzed phenomena. Evaluating datasets for sufficiency for in-distribution and out-of-distribution generalization is another relevant challenge in the field. The solution might significantly help both in interpreting model learning outcomes and in designing better evaluation suites and benchmarks. Recall that our results might not be transferable to other multilingual models, specifically different in the architecture design and pretraining objectives, for example, XLM-R, mBART (Liu et al., Reference Liu, Gu, Goyal, Li, Edunov, Ghazvininejad, Lewis and Zettlemoyer2020), and mT5 (Xue et al., Reference Xue, Constant, Roberts, Kale, Al-Rfou, Siddhant, Barua and Raffel2021).
7. Conclusion
This paper presents an extension of the ongoing research on the fine-tuning stability and consistency of linguistic generalization to the multilingual setting. We propose six GLUE-style textual entailment and broad-coverage diagnostic datasets for French, German, and Swedish. The datasets are constructed by translating the original datasets for English and Russian, with culture-specific phenomena localized and language phenomena adapted under linguistic expertise. We address the problem in the NLI task and analyze the linguistic competence of the mBERT model along with the impact of the random seed choice, training data size, and presence of linguistic categories in the training data. The method includes the standard SuperGLUE fine-tuning and evaluation procedure, and we ensure that the model is run with precisely the same hyperparameters but with different random seeds. The mBERT model demonstrates the per-category instability generally for categories that involve lexical semantics, logic, and predicate-argument structure and struggles to learn monotonicity, negation, numeracy, and symmetry. However, related languages show similar performance in active and passive voice, conjunction, disjunction, prepositional phrases, and quantifiers. We also find that the generalization performance and fine-tuning stability can be improved for all languages by using additional data only in English, contributing to the cross-lingual transfer capabilities of multilingual LMs. However, the number of training samples containing a particular feature might also hurt all model instances’ performance. We leave a more detailed investigation of this behavior for future work. Another fruitful direction is analyzing a more diverse set of monolingual and multilingual LMs, varying by the architecture design and pretraining objectives. In general, our results are consistent with a growing body of related studies which explore aspects of learning inference properties from different perspectives, including findings for Chinese, a language typologically different from the considered ones in our work. We are publicly releasing the datasets, hoping to foster the diagnostic investigation of LMs in a cross-lingual scenario, particularly in terms of benchmarking, which might promote a more holistic understanding of multilingualism in LMs and their cross-lingual knowledge transfer abilities.
Funding statement
The work has been supported by the Ministry of Science and Higher Education of the Russian Federation within Agreement No 075-15-2020-793.
8. Appendix
8.1 Fine-tuning stability and random seeds
Table 6 presents the results of the per-category fine-tuning stability for each language.
Figures 6–9 show the results of the diagnostic evaluation of the mBERT model fine-tuned with multiple random seeds on the corresponding RTE dataset for each language (see Section 5.2).
8.2 Fine-tuning stability and dataset size
Figure 10 depicts the results of the language-wise diagnostic evaluation of mBERT when fine-tuned on combined RTE and MNLI training samples. Comparing the heat map with that of Figure 1 (see Section 5.2), we observe that MCC scores for some categories have greatly improved for all languages (Conjunction, Coordination scope, Core args, Genitives/Partitives, Prepositional phrases, and Universal), while logic categories negatively correlate with the model predictions (Disjunction, Downward monotone, and Intervals/Numbers). Figures 11–15 show seed-wise diagnostic evaluation of the mBERT model when fine-tuned on combined RTE and MNLI training datasets with multiple random seeds.
8.3 Automatic annotation of diagnostic features
8.4 Coarse-grained probing analysis
A prominent methodology to explore the inner workings of pretrained LMs is to train a lightweight classifier over features produced by them to predict a linguistic property. During the probing procedure, the hidden representations produced by the model are taken from various layers of the transformer, and then a simple classifier is trained to predict a linguistic feature based on the given supervision (e.g., whether a particular category is present in a sentence or not). The underlying assumption is that if the classifier can predict the property, then the representations implicitly encode the linguistic knowledge.
We apply the annotation procedure (see Section 5.4) to create a set of three binary classification tasks for English and Russian that correspond to the coarse-grained diagnostic categories of Logic, Lexical Semantics, and Predicate-Argument structure. The task is to identify if a particular category is present in a given pair of sentences. We follow the SentEval probing methodology (Conneau et al., Reference Conneau, Kruszewski, Lample, Barrault and Baroni2018a) to train a linear classifier using cross-entropy loss, optimized with Adam (Kingma and Ba Reference Kingma and Ba2014). The classifier is trained on the corresponding annotated RTE dataset’s concatenated train and validation sets. We tune the L2-regularization parameter $\in [0.1, ..., 1{e}^{-5}]$ on the RTE test set and evaluated performance on the diagnostic set using accuracy score. The input to the classifier is a concatenation of the mean-pooled intermediate representations of each sentence in a given pair. We probe a pretrained mBERT model as a reference, and six mBERT models fine-tuned on the RTE task with multiple random seeds $\in [0;\; 5]$ (see Section 5.2).
We now provide a brief description of the probing results. The overall pattern is that the probing trajectories across the models are more consistent for English than Russian. Specifically, the linguistic properties tend to be more localized in the lower layers than in the higher ones, meaning that the latter is more affected by the fine-tuning (Wu et al., Reference Wu, Belinkov, Sajjad, Durrani, Dalvi and Glass2020). Note that the lower and middle layers of the models for Russian are less similar, which is demonstrated by sharp increases and decreases in the probe performance. Besides, the fine-tuning effect differs across the tasks, for example, leading to better performance over the Lexical Semantics task for English, and vice versa for Russian (see Figure 17). This can be interpreted as follows: the fine-tuning unpredictably causes the model either to “forget” about a particular knowledge or to “acquire” the knowledge of low certainty, shown over several RS models for both English and Russian. Despite the varying trajectories, the performance results remained similar for both languages, ranging from being close to or below random choice (see Figures 17 and 18) to becoming more confident in the lower layers on the Logic tasks (see Figure 16), with overall quality around 65% accuracy score.