1. Introduction
Word sense disambiguation (WSD), an exigent task in natural language processing (NLP), means assigning definite sense to an ambiguous word in the given text (Zhang et al. Reference Zhang, Zhang and Gao2023; Haouassi et al. Reference Haouassi, Bekhouche, Rahab, Mahdaoui and Chouhal2023). Understanding the meaning of words is very easy for humans when they communicate; however, machines find it very difficult to interpret natural language data. Understanding the natural languages by machines makes it necessary to solve the WSD problem. WSD is not a new research problem in NLP but has a long history. It has been classified as an AI-Complete problem (Navigli Reference Navigli2009) in NLP. Different NLP tasks like Machine Translation (Wang et al. Reference Wang, Li, Huang, Chen, Zhang and Zhou2023), Information Retrieval (Agirre and Edmonds Reference Agirre and Edmonds2007), Information Extraction (Abderrahim and Abderrahim Reference Abderrahim and Abderrahim2022), Question Answering (Rahman and Borah Reference Rahman and Borah2022), Sentiment Analysis (Kharate and Patil Reference Kharate and Patil2021), Text Summarization (Sert et al. Reference Sert, Elma and Altinel2023), Discourse Analysis (Saeed et al. Reference Saeed, Nawab, Stevenson and Rayson2019) are dependent on it for better performance. WSD problem when catered carefully, the performance of these NLP tasks gets boosted. However, the lack of high-performance WSD systems hinders the creation of effective NLP systems for these NLP tasks.
Although a number of researchers have put their efforts into solving this problem in the last many decades, these efforts cover a few languages like English, Catalan, Romanian, Italian, Spanish, Chinese, Swedish, Japanese, and Korean. Little attention was given to other languages, in particular to South Asian languages (Rouhizadeh et al. Reference Rouhizadeh, Shamsfard and Tajalli2022; Saeed et al. Reference Saeed, Nawab, Stevenson and Rayson2019). There is still a vacuum to work out the WSD problem for low-resource languages like Kashmiri.
WSD task in the research perspective is investigated into two forms; Lexical Sample WSD task and All-Words WSD task. A set of pre-selected ambiguous words are disambiguated in the Lexical Sample WSD task whereas the focus of the All-Words WSD task is to decipher all ambiguous words existing in the given input text. The Lexical Sample version of the WSD task has been explored more often in comparison to the All-Words WSD task by researchers as the All-Words WSD task is more challenging (Rouhizadeh et al. Reference Rouhizadeh, Shamsfard and Tajalli2022). Kashmiri language, however, lacks research on both versions of the WSD task due to the unavailability of linguistic resources. The main objective of this research work is to develop the first standard sense-tagged dataset for the Lexical Sample WSD task which is a prerequisite to use data-driven approaches to solve the WSD problem.
Different categories of approaches developed so far to solve the WSD task are (1) Knowledge-based approaches (2) Corpus-based approaches and (3) Hybrid approaches (Nodehi and Charkari Reference Nodehi and Charkari2022). Knowledge-based WSD approaches use the clues available in machine-readable dictionaries etc. to learn the disambiguation process. Corpus-based approaches need a large dataset to infer the exact sense of a dubious word based on context. Corpus-based Approaches are of two types (a) Supervised and (b) Unsupervised approaches.
Supervised WSD approach means learning the WSD task from a sense-annotated corpus; unsupervised approaches on the other hand learn the disambiguation task using untagged corpora. Hybrid approaches to WSD take advantage of multiple resources like Wordnet (Miller Reference Miller1995) and tagged corpora. When the performance of these different approaches is analyzed, it is observed that supervised approaches are superior (Mihalcea Reference Mihalcea2007; Park et al. Reference Park, Shin and Lee2022). The use of supervised WSD approaches for languages like English, French, Japanese, Dutch, etc. has shown steady progress. On the other hand, WSD research has lagged behind for low-resource languages due to the unavailability of sense annotated datasets. In this research work, we introduce, (Kmri-WSD-DatasetFootnote a), a novel WSD dataset for the Kashmiri language. The dataset so developed has a total of 19854 manually sense annotated instances for 124 ambiguous Kashmiri words.
Kashmiri, classified as an Indo-Aryan language, is one of the official languages as per the Indian constitution. Mainly spoken by residents of Jammu and Kashmir and Kashmiri diaspora in various parts of the world, Kashmiri is known for its rich literary heritage. Key characteristics of the Kashmiri language include its complex phonological system, great inflection in grammar, rich vocabulary, and dialect diversity (Koul and Wai Reference Koul and Wali2015). It is a highly inflected language with verb-second (V2) word order. From the computational resources perspective, Kashmiri is considered a low-resource language in comparison to other Indian languages like Hindi, Punjabi, Telugu, etc. It has not been explored much by researchers in the NLP domain. Some of the important research works to develop computational resources for the Kashmiri language that are useful in NLP tasks are discussed here. Gold standard raw corpus for Kashmiri has been developed at the Central Institute of Indian Languages (Ramamoorthy et al. Reference Ramamoorthy, Choudhary and Bhat2019). The corpus has 466,054 words obtained from different books, magazines and newspapers. The corpus has text from two domains Aesthetics (85.93%) and Social Sciences (14.7%). A raw corpus of 2.3 million words was also created as a part of the project EMILLE supported by the universities of Lancaster and Baker et al. (Reference Baker, Hardie, McEnery, Cunningham and Gaizauskas2022). The purpose of the project was to develop a 67-million-word corpus for different Indic languages. In Mehdi and Kak (Reference Mehdi and Kak2018) paradigm based approach is capitalized to develop a morphological analyzer for Kashmiri. The authors first identified Part-of-Speech (PoS) of the lexical terms and then created different paradigms for different PoS categories for morphological analysis. Kak et al. (Reference Kak, Ahmad, Mehdi, Farooq and Hakim2017) presents the challenges and issues related to the development of Kashmiri Wordnet. Lawaye and Purkayastha (2014) proposed an automatic PoS tagger for Kashmiri using Conditional Random Field (CRF). In Mehdi and Lawaye (Reference Mehdi and Lawaye.2011) research work, the first Unicode font for Kashmiri called “AFAN Koshur Naksh” is designed. Kak et al. (Reference Kak, Mehdi and Lawaye2009) presented a tagset consisting of 26 tags for Kashmiri PoS tagging. Banday et al. (Reference Banday, Panzoo, Lone, Nazir, Malik, Rasoool and Maqsood2009) presented a trilingual (English-Hindi-Kashmiri) sense-based e-dictionary. The dictionary contains different senses of a word based on its environment, hence can be used for resolving the sense ambiguity. Recently Research Center for Technical Development of Punjabi language, Punjabi University, Patiala, has decided to develop linguistic tools such as corpus, storage code converters, transliteration tools, typing tools, digital dictionaries, word predictors for four languages namely; Kashmiri, Pashto, Balochi and Dari. To the best of our knowledge, no standard Kashmiri sense-annotated corpus is available freely for research purposes. This is the primary motivation behind the research work presented in this paper. This work is actually an extension of research work Mir et al. (Reference Mir, Lawaye, Rana and Ahmed2023) to create a sense-tagged corpus for the development of a supervised WSD system for the Kashmiri language.
Key contributions of the paper are:
-
Development of maiden standard corpus for Kashmiri Lexical Sample WSD task: This is the first standard corpus for the Kashmiri Lexical Sample WSD task that we are aware of. The advancement of Kashmiri language study in the NLP domain would be aided by the inclusion of this dataset. It will also assist scholars in developing comparable resources for other languages with limited resources.
-
Setting criteria for future Kashmiri Lexical Sample WSD research: A collection of machine learning WSD algorithms is run on the developed corpus to give a baseline for evaluating future Kashmiri Lexical Sample WSD systems. The analysis of the results produced by various algorithms run on the dataset is also provided.
-
Utility of the developed corpus in assessing other Kashmiri NLP tasks: The developed corpus is tokenized, POS-labeled, and sense-tagged. As a result, it may be used as a standard for preprocessing tools such as tokenizers and PoS taggers.
The paper is delineated into different parts: Section 2 discusses the important sense-tagged datasets developed for different languages; Section 3 is about the procedure followed to develop the sense-tagged corpus; Section 4 discusses the methods used to explore WSD task in Kashmiri; Section 5 presents the results and error analysis of the different methods used to carry out the WSD task and the Conclusion is presented in Section 6.
2. Related work
Sense-tagged datasets are commonly used in NLP tasks, particularly in WSD and related tasks like semantic role labeling and machine translation. These datasets are labeled with information about the specific sense or meaning of words in context. Research for developing sense-annotated corpora started a long time ago and a number of sense-annotated corpora have been developed for various natural languages. They are often used for training and evaluating WSD systems. The annotated corpora have been developed for handling both the Lexical Sample WSD task as well as the All-Words WSD task. Important WSD datasets developed for foreign and Indian languages are discussed below.
2.1 WSD datasets for foreign languages
English being the most commonly used language worldwide has been the hotspot for NLP research. Different WSD datasets have been developed till now for English with different sizes and characteristics. SemCor (Miller et al. Reference Miller, Chodorow, Landes, Leacock and Thomas1994) is the largest available manually curated English sense-tagged dataset aligned with WordNet senses. It has about 234,000 tokens attached with syntactic category and appropriate sense. It has been extensively used by researchers for different NLP tasks. Some important research works in which SemCor is used include Song et al. (Reference Song, Ong, Ng and Lin2021); Luan et al. (Reference Luan, Hauer, Mou and Kondrak2020); Nodehi and Charkari (Reference Nodehi and Charkari2022). DSO Corus (Ng and Lee Reference Ng and Lee1996) is another WSD corpus with 192800 instances for a set of 191 target words among which 121 are nouns and 72 are verbs. This corpus has also been employed by different researchers for evaluating WSD systems. Patankar and Devane (Reference Patankar and Devane2017) utilized the DSO corpus to analyze the WSD problem in depth and its role to resolve the multilingual translation problem. Itankar and Raza (Reference Itankar and Raza2020) also utilized the DSO corpus for analyzing the role of WSD in machine translation by carrying out experiments on fifteen ambiguous words. MASC (Ide et al. Reference Ide, Baker, Fellbaum and Passonneau2010), a multilayered annotated dataset, stands for Manually Annotated Sub-Corpus. The corpus has 500000 lexical items with named entity annotations, part-of-speech tags, syntactic structure, etc. making it applicable in multiple NLP tasks like WSD, syntactic parsing, named entity recognition, etc. Wijeratne et al. (Reference Wijeratne, Balasuriya, Sheth and Doran2017) utilized MASC for WSD task for emoji sense discovery. de Lacalle and Agirre (Reference de Lacalle and Agirre2015) also used data from MASC to train the It Makes Sense (IMS) (Zhong and Ng Reference Zhong and Ng2010) WSD model and produced good results. In İlgen et al. (Reference İlgen, Adali and Tantuğ2012), a Lexical Sample Dataset for Turkish is developed. The dataset comprises of noun and verb sets with each 15 highly ambiguous words. For each ambiguous word, the dataset contains at least 100 instances collected from various online sources. In Rakho et al. (Reference Rakho, Laporte and Constant2012), WSD corpus is created for 20 polysemous French verbs. The corpus created has been annotated using four different sense inventories taking into account various types of evidences. OMSTI dataset (Taghipour and Ng Reference Taghipour and Ng2015) acronym for One Million Sense Tagged Instances is a huge sense tagged dataset and is a valuable asset for computational linguistics research. The dataset consists of one million English words, each of which has been sense-tagged according to its meaning in context. Its diversity makes it fruitful for resolving the English WSD problem. Le et al. (Reference Le, Postma, Urbani and Vossen2018) utilized this dataset to carry out WSD task using LSTM (Hochreiter and Schmidhuber Reference Hochreiter and Schmidhuber1997) algorithm. Kokane and Babar (Reference Kokane and Babar2009) used the OMSTI dataset to carry out the supervised WSD task based on neural networks. Kokane et al. (Reference Kokane, Babar and Mahalle2023) utilized the OMSTI dataset along with other two datasets to carry out the WSD experiments based on the adaptive WSD model. The results showed that the proposed WSD model produced better results when utilizing the OMSTI dataset. Authors in Scarlini et al. (Reference Scarlini, Pasini and Navigli2020) also created large-scale multilingual WSD datasets for five different languages to mitigate the knowledge-acquisition bottleneck faced in multilingual WSD. The dataset released contains about 15 million instances with sense annotations for five languages (English, German, Spanish, French, and Italian). It also contains five English datasets belonging to distinct semantic domains. The dataset is annotated automatically instead of using the manual annotation method and when tested on supervised methods showed better performance compared to other automatically created corpora. In Pasini et al. (Reference Pasini, Raganato and Navigli2021), cross-lingual evaluation framework for WSD, XL-WSD, covering sense-tagged datasets for 18 languages. The datasets so developed are then utilized for carrying out the WSD task implementing knowledge-based and neural network approaches. The results produced by these approaches are quite encouraging. In Zheng et al. (Reference Zheng, Li, Dai, Chen, Liu, Sun and Liu2021), Lexical-Sample WSD dataset is developed for Chinese. The developed corpus is then utilized to enhance the performance of the Chinese WSD task by utilizing word-formation knowledge. Kirillovich et al. (Reference Kirillovich, Loukachevitch, Kulaev, Bolshina and Ilvovsky2022) created a sense-tagged Russian corpus. The corpus is created manually with raw data obtained from OpenCorporaFootnote b and senses from RuWordNet.Footnote c It contains a total of 6751 sentences with 109,893 lemmas. The corpus is then utilized for testing the unsupervised WSD system. In Laba et al.(Reference Laba, Mudryi, Chaplynskyi, Romanyshyn and Dobosevych2023), a fine-tuned pretrained language model is presented for Ukrainian WSD. A WSD evaluation dataset is created to achieve this objective. The dataset has 2882 samples each having lemmas, possible meanings with examples for each meaning.
2.2 WSD datasets for Indian languages
In comparison to foreign languages, Indian languages have not been explored much in the NLP domain, and there are no such standard linguistic resources available for these languages. Although researchers have made several attempts to tackle the WSD problem in Indian languages, but the datasets used are usually small in size and private. Here we will discuss some important WSD datasets available in Indian languages and important attempts to handle the WSD problem in these languages. WSD in Hindi, the national language of India, has been explored by different researchers. A Lexical Sample Hindi WSD dataset is discussed in Singh and Siddiqui (Reference Singh and Siddiqui2016). The dataset has only 7506 for 60 ambiguous words. For this corpus, instances are reaped from Hindi CorpusFootnote d and Internet and senses from Hindi WordNet.Footnote e Based on this corpus, Hindi WSD task has been analyzed in research efforts presented in Singh and Siddiqui (Reference Singh and Siddiqui2015a), Singh and Siddiqui (Reference Singh and Siddiqui2015b); Singh et al. (Reference Singh, Siddiqui and Sharma2014). Singh and Siddiqui (Reference Singh and Siddiqui2015a) used two supervised algorithms, one based on the conditional probability of co-occurring words and another one is the Naïve Bayes algorithm to analyze the impact of karaka relations on Hindi WSD. Singh and Siddiqui (Reference Singh and Siddiqui2015b) compared the performance of three WSD algorithms trained on this dataset. The first algorithm used is corpus-based Lesk (Reference Lesk1986), the second algorithm is based on conditional probability and co-occurring words and the third algorithm is based on a classification information model. Singh et al. (Reference Singh, Siddiqui and Sharma2014) used this dataset to develop Hindi WSD based on the Naïve Bayes algorithm. In this different feature combinations have been explored to check their impact on the performance of the WSD system. In Sarmah and Sarma (Reference Sarmah and Sarma2016), WSD for Assamese is carried out based on supervised methodology. Naïve Bayes WSD model is trained on a manually created WSD corpus having 2.7K sentences, and it yielded 71% accuracy. Next, the size of the WSD corpus is increased to 3.5K sentences which enhanced the results by 7%. The WSD corpus is created utilizing Assamese corpus (Sarma et al. Reference Sarma, Bharali, Gogoi, Deka and Barman2012) and Assamese WordNet (Sarma et al. Reference Sarma, Medhi, Gogoi and Saikia2010). In Walia et al. (Reference Walia, Rana and Kansal2018), supervised methodology to WSD is adopted with reference to Punjabi language. The researchers created a WSD corpus for 100 ambiguous Punjabi words utilizing Punjabi Corpus obtained from ELRAFootnote f and Punjabi WordNet (Kaur et al. Reference Kaur, Sharma, Preet and Bhatia2010). The same dataset has been utilized by other researchers for Punjabi WSD (Walia et al. Reference Walia, Rana and Kansal2020). Singh and Kumar (Reference Singh and Kumar2020) also created a sense-tagged corpus and employed deep learning techniques for WSD. The corpus created has 9251 instances for 66 ambiguous Punjabi words with 25.7 average number of instances per sense. In Saeed et al. (Reference Saeed, Nawab, Stevenson and Rayson2019), a benchmark Lexical Sample WSD corpus, ULS-WSD-18 Corpus, is developed for Urdu. UrMono Corpus Jawaid et al. (Reference Jawaid, Kamran and Bojar2014), the largest available raw Urdu corpus is utilized to obtain instances, and Urdu Lughat Board (Reference Board2008) is used as sense inventory. For each target word, there are 75 + 15n instances (n gives the number of senses for the target word) in the developed WSD corpus. The applicability of the developed dataset for the development and evaluation of the Lexical Sample WSD system for Urdu is then checked by using different supervised WSD algorithms, viz. Naïve Bayes, SVM, ID3 Quinlan (Reference Quinlan1986), k-NN. Different types of features like bag-of-words (BoW), most-frequent sense, part-of-speech, and word embeddings were used to train these models so as to analyze their role in WSD. In Sruthi et al. (Reference Sruthi, Kannan and Paul2022), unsupervised LDA-based WSD disambiguation methodology is employed to Malayalam. A WSD corpus with 1147 contexts is created for target words to carry out experiments. The experiments have shown better performance utilizing co-occurrence information and synonyms as features. In Das Dawn et al. (Reference Das Dawn, Khan, Shaikh and Pal2023), Bengali sense-annotated corpus for the Lexical Sample WSD task is created. The dataset covers a total of hundred ambiguous Bengali words. There are three to four senses used in the annotation process for each ambiguous word. For each sense of a particular word used in the annotation, there are ten paragraphs in the developed corpus. The authors also proposed a generic architecture for utilizing this dataset to analyze the WSD task in Bengali. In Patil et al. (Reference Patil, Bhavsar and Pawar2023), WSD for Marathi is investigated using BERT Tenney et al. (Reference Tenney, Das and Pavlick2019). The model is evaluated on the WSD dataset having 5285 instances for 282 dubious terms.
3. Dataset preparation process
This section is devoted to the systematic technique adopted to develop the sense-tagged dataset. In Subsection 3.1, data collection is discussed, Subsection 3.2 discusses the preprocessing used to clean raw corpus, Subsection 3.3 is devoted to the procedure followed to select the target words, Subsection 3.4 discusses the sense inventory selection procedure, Subsection 3.5 discusses the extraction of instances for the target words, Subsection 3.6 discusses the sense annotation process, and Subsection 3.7 discusses the encoding used for final corpus prepared. Various steps involved in the corpus development are depicted in the diagram below (Figure 1) and discussed in the following subsections.
3.1 Raw corpus
Due to the deficiency of data in digital form for the Kashmiri language, we faced a huge challenge in managing the dataset to carry out this research work. We explored different resources available both online and offline to prepare the dataset for smooth conduction of research work. PoS tagged dataset produced by Lawaye and Purkayastha (Reference Lawaye. and Purkayastha2014) was obtained which contains about 120K tokens. Other options available for obtaining data include Kashmiri WordNet (Kak et al. Reference Kak, Ahmad, Mehdi, Farooq and Hakim2017), trilingual (English-Hindi-Kashmiri), and E-dictionary (Banday et al. Reference Banday, Panzoo, Lone, Nazir, Malik, Rasoool and Maqsood2009). In addition, we synthesized primary data wherever we found the deficiency. The corpus compiled from different resources is generic and has text from various domains like history, sports, politics, religion, and news.
3.2 Data preprocessing
The raw corpus written in the Perso-Arabic script, we collected in the first step which is not PoS tagged was passed through a few preprocessing steps so as to make it handy for further processing. The first step in the preprocessing we did was to tokenize it and divide it into sentences as it was too messy and the sentences were not properly separated. Using NLPTK library the dataset was tokenized. We stored the dataset in a separate text file (8 MB) with one sentence per line. As the data were collected from multiple resources, and there is no standard writing system in Kashmiri, the same word has different spelling in different resources. For example, the words and seem different but are the same with different spellings. After analyzing the raw corpus carefully, we resolved all such cases by taking spellings from Kashmiri WordNet as standard. PoS tagger developed by Lawaye and Purkayastha (2014) is employed on the non-PoS tagged data for PoS tagging with an accuracy of 94%. The file containing the PoS-tagged dataset has one token per line along with its PoS tag. A sample of PoS data is shown in Figure 2.
3.3 Target word selection
The ultimate product of the dataset preparation process is the sense-tagged dataset for the Kashmiri Lexical Sample WSD task. In order to achieve this objective target words are selected. We extracted ambiguous-word list present in the Kashmiri WordNet. The PoS-tagged dataset is then analyzed for the presence of these ambiguous words. We calculated the frequency of ambiguous words present in the dataset. Out of the ambiguous words present in the dataset, we selected 124 words that have the maximum frequency in the dataset for further processing. Not only frequency we also checked whether the word takes different senses in the sufficient number of instances in the dataset so as to make it good for training the WSD model for predicting different senses that the word can take in different situations. For example, the word (aamut) although is ambiguous and has a good frequency in the dataset still it was not chosen as the target word as it takes single sense in most of the sentences out of the three senses available in the Kashmiri WordNet for it. Based on the same reason, we dropped many other words from the target word list, even though they have good frequency in the dataset. Some of these words include (zyad), (aagur), (dil), (aham), (shamil), (istimall), and (amal). Table 1 gives important information about the ambiguous words selected as target words.
3.4 Sense inventory
After finalizing the target word list, we need possible senses that these words can take in different contexts. To get the different senses resources like machine readable dictionaries, Wikipedia, and WordNet have been used by researchers. In this research work, we had two resources available to us; Kashmiri WordNet and trilingual (English-Hindi-Kashmiri) E-Dictionary. We carefully analyzed the senses obtained from these resources to select the best option for sense inventory creation. It was observed that Kashmiri WordNet has a greater number of senses for many target words than are available in trilingual E-Dictionary. From Kashmiri WordNet, the range of senses for target words spans from 2 to 33 which is higher than the range of senses for these words available in trilingual E-Dictionary. For example, for the word (thud) having NOUN as a lexical category, Kashmiri WordNet gives 5 senses as shown in Table 2 below.
In trilingual (English-Hindi-Kashmiri) trilingual E-Dictionary the word (Door) has 11 meanings. For the word (travun), Kashmiri WordNet has 33 meanings, whereas trilingual E-Dictionary returns 20 meanings for the word. There are 9 senses for the word (manun) in Kashmiri WordNet, whereas trilingual E-Dictionary contains only 7 senses for this word. Kashmiri WordNet gives 13 senses for the word (seud) for which trilingual E-Dictionary has only 7 senses. Likewise, there are 12 senses for the word (pakun) in Kashmiri WordNet, whereas trilingual E-Dictionary has only 2 senses for the same. There are many other cases where Kashmiri WordNet gives more senses than trilingual E-Dictionary; hence, we choose it for sense inventory creation. After choosing Kashmiri WordNet for sense inventory creation, we extracted senses for the target words from Kashmiri WordNet. Kashmiri WordNet actually divides the words in terms of concepts. The fundamental unit in Kashmiri WordNet is synset in the same manner WordNet for English and other languages. Each word is associated with a synset and has a synset ID, lexical category, concept, and example sentence. Kashmiri WordNet can be accessed from the Indo WordNetFootnote g interface. Two important characteristics that we found in the senses available in the Kashmiri WordNet for the target words are a) the senses are very much fine-grained and b) it gives senses that are not commonly used. As far as our task is concerned, it is very much difficult to get instances for fine-grained senses. So, to overcome this problem, we adopted the same procedure as is followed by other researchers, i.e., convert these fine-grained senses into coarse-grained senses. This transformation involves grouping the fine-grained senses that are similar in meaning to a more general sense. This objective is achieved by following the sense compression technique proposed in Vial et al. (Reference Vial, Lecouteux and Schwab2019). The compression technique utilizes the different semantic relationships (hypernymy/hyponymy/synonymy/antonymy/meronymy/holonymy $\ldots$ ) available in Kashmiri WordNet. Synsets are divided into clusters C( $c_1$ , $c_2$ , $\ldots$ , $c_n$ ) based on these relationships. These clusters are formed iteratively. Initially, each synset from the synset list S( $s_1$ , $s_2$ , $\ldots$ , $s_n$ ) is placed in a separate cluster, i.e., C= (( $s_1$ ), ( $c_2$ ), $\ldots$ , ( $c_n$ )). Then at each iteration, we arrange C according to the sizes of the clusters, and then we analyze whether the smallest cluster $c_i$ is related to another cluster to $c_j$ . If the two clusters are related by having some common semantic relation, then we merge the two clusters $c_i$ , $c_j$ . If there is no common semantic link between the two clusters, then the merge operation is canceled and then we take the next smallest cluster and repeat the same process. The algorithm stops when there is no further link possible. After converting fine-grained senses into coarse-grained senses, we created a sense inventory. The sense inventory generated contains the target word and the set of senses finalized to be used in the annotation process for the target words. Figure 3 shows a snapshot of the sense inventory.
3.5 Target word instances
The instances for the target words are extracted from the PoS-tagged dataset that we created after the preprocessing and PoS tagging discussed in Subsection 3.2. We obtained at least 75 + 15n (n is the number of senses for target word) number of sentences for each target word as per the guidelines set in the SenseEval-2 task for Lexical-Sample WSD task (Edmonds and Cotton Reference Edmonds and Cotton2001; Saeed et al. Reference Saeed, Nawab, Stevenson and Rayson2019). For example, the target word (taiz) has 8 senses used in annotation and has ( $75+15\times 8$ ) 195 instances in the final sense-tagged corpus.
3.6 Sense annotation
The next phase in the dataset preparation process was to carry out the annotation. We employed the manual sense annotation technique for this purpose. Three annotators all naïve speakers of Kashmiri were given this task after making them familiar with the annotation task. To achieve a high-quality annotated corpus, we followed the steps used in Saeed et al. (Reference Saeed, Nawab, Stevenson and Rayson2019). We divided the overall annotation process into two phases. Two annotators were given the task to annotate the target word with the appropriate sense as per the context. They were provided with files containing sentences separately for each target word and the senses to be used for annotation. Once the annotators finished the annotation process, they discussed the annotation task and resolved the cases where they found difficulty or confusion in the annotation. We then calculated the inter-annotator agreement and found it 92% and 0.83 weighted Kappa score (McHugh Reference McHugh2012) that indicated a high level of agreement. In the second phase of the annotation, the linguist expert was involved in reviewing the annotated dataset so as to make corrections wherever necessary.
3.7 XML encoding
The output of the preceding task is sense tagged corpus where the target words are tagged with the appropriate sense. This corpus contains simple text files one for each target word. In the next step, we transformed the sense-tagged corpus into XML format which is the commonly used format for storing sense-tagged datasets. This task is carried out using xml.etree. ElementTree module available in Python. The XML-encoded corpus has 124 total files one for each target word. Each file has <contextfile fileno=’filenumber’ filename=’kmr_word’>as the root element in which the fileno attribute specifies the file number (1,2, $\ldots$ 124) and the filename attribute which specifies the target word for which the file contains instances. Each sentence starts with <sentence s_id=”sentence_number”>element which indicates the start of a new sentence and the s_id attribute gives the sentence number assigned to the sentence in the file. Each word in the sentence is enclosed within the <wf>element pair which has pos as an attribute specifying the PoS category of the word. <wf>has sense_id as an additional attribute for target words which specifies the unique number assigned to the sense that fits the target word in the given sentence. Figure 4 below shows an instance of the target word (thud) in the sense annotated corpus.
The sense-tagged dataset so created is finally divided into two partitions, training set and test sets for each target word. Splitting the dataset into training and test suites is governed by the following:
-
Total instances for a particular word are divided into 80:20 ratio for training and test purposes.
-
In order to make sure there is a proper mix of instances for each sense that a target word has in both training and test suites, 20% of instances for each sense are placed in the test suite and 80% instances in the training suite.
-
Training and test splits for each target word are stored in separate XML file.
4. WSD experiments
After preparing the sense-tagged dataset, we used it to carry out experiments. We used it on different supervised machine learning methods to check its effectiveness for analyzing and assessing WSD models. To train the machine learning models, we used three distinct feature extraction techniques; most frequent sense (MFS) (Gale et al. Reference Gale, Church and Yarowsky1992), traditional Bag of Words (BoW) (Goldberg and Hirst Reference Goldberg and Hirst2017) and Word2VecFootnote h which are briefly discussed below. Different machine learning algorithms trained on these features are evaluated for their performance for handling Kashmiri WSD.
4.1 Most frequent sense
In natural languages, it is common that a polysemous word takes one sense more frequently than other senses it can have. This most frequently occurring sense of polysemous word is termed as MFS. In WSD systems, MFS-based approach is commonly used as a baseline to assess the performance of other machine learning techniques on a given dataset (Saeed et al. Reference Saeed, Nawab, Stevenson and Rayson2019; Kumar Reference Kumar2020). In this research work, we compared the results produced by other algorithms with MFS-based method.
4.2 Bag of words (Count vectorizing) model
BoW, also known as count vectorizing, is the basic model used to transform the raw data into a vector representation. In this approach, we calculate the prevalence of content words in the given data. This model converts the given input into numeric form in which different dimensions represent the words existing in the given input without taking into account the syntax or semantics. The dimensions may simply contain 0s or 1s representing the presence or absence of words in the given corpus or it may also represent frequency for that word in the given corpus. By default, this approach carries out some preprocessing steps like conversion of input to lowercase, utf-8 encoding, ignoring single characters, and punctuations. It also provides the user with the facility to customize tokenization, preprocessing, etc. so that it can be made helpful for languages other than English. In this study, we have implemented this approach for the Kashmiri language.
4.3 Word2Vec model
It is a deep-learning approach used to obtain dense vector representation for a given input. The vector representations computed using Word2Vec have proven to be fruitful in different downstream NLP tasks. Word2Vec requires a large corpus as input and generates the vocabulary of all the words in the corpus. Finally, dense embeddings are obtained for each word in the vocabulary. As vector representations produced by Word2Vec capture semantic and contextual information from the given input very well, many researchers have used this approach to solve the WSD problem (Kumari and Lobiyal Reference Kumari and Lobiyal2022; Saeed et al. Reference Saeed, Nawab, Stevenson and Rayson2019). There are two variants present in Word2Vec to produce vector representations; a) continuous bag-of-words (CBOW) and b) Skip-Gram. The CBOW variant uses the context words in a specific window size to forecast the middle word (target word). It works on the assumption that the order of words in the context widow is irrelevant to the prediction hence is named so. The objective of the Skip-Gram variant of Word2Vec is to forecast the context words within a specific window size for a target word. In this study, we employed the Skip-Gram architecture to obtain word embeddings (WE) for all the target words. The model is trained on the whole corpus with the objective of maximizing the probability of forecasting the context in which the target word exists. Considering the word sequence ( $w_1$ , $w_2$ , $\ldots$ , $w_n$ ), the objective function can be written as (Mikolov et al. Reference Mikolov, Sutskever, Chen, Corrado and Dean2013):
where n is training context size. The basic formulation of Skip-Gram defines $p(x_t+i|x_t)$ using softmax function :
vw and v’w depict input and output vector representations of w. w gives word count in vocabulary. We used the Gensim library to create the Word2Vec model. The model so created is used to obtain the nearest word embeddings for all target words keeping dimensions set to different values (100, 200, 300). Using these embeddings as features, we trained different machine learning algorithms and then tested these algorithms.
4.4 Classification algorithms used
Once the feature extraction process is over in the next step, we use these features to train machine learning algorithms. In this research work, we employed five classification algorithms which are discussed briefly here:
4.4.1 Naive Bayes
Naïve Bayes WSD refers to a technique that uses the Naive Bayes algorithm to perform WSD. It calculates the probability of a word belonging to a particular sense given the context in which it appears. This is done by estimating the conditional probabilities of the features (e.g., surrounding words and syntactic patterns) given each sense and then applying Bayes’ theorem to obtain the posterior probability of each sense. In Naive Bayes WSD, the sense with the highest posterior probability is chosen as the disambiguated sense for the word in question.
4.4.2 Dl4jMlpClassifier
Dl4jMlpClassifier (Lang et al. Reference Lang, Bravo-Marquez, Beckham, Hall and Frank2019) is a deep learning library for Java that provides tools and functionality for building and training deep neural networks. The Dl4jMlpClassifier algorithm is based on a multilayer perceptron (MLP) architecture, which is a type of feedforward neural network. MLPs consist of multiple layers of interconnected artificial neurons, where each neuron is a mathematical function that takes input from the previous layer and produces an output.
4.4.3 IBk
IBk algorithm (Reynolds et al. Reference Reynolds, Kontostathis and Edwards2011), also known as the k-nearest neighbors (k-NN) algorithm, is a simple and intuitive machine learning algorithm that classifies new instances based on their proximity to labeled instances in a feature space. The algorithm is trained on a set of features that capture the context in which the word appears. Given a new word instance to disambiguate, the algorithm identifies the k nearest neighbors in the feature space having the most similar feature representations to the new instance. The algorithm assigns the most frequent or highest-weighted sense from the neighbors as the disambiguated sense for the new word instance.
4.4.4 J48
J48 is a classification algorithm based on the C4.5 decision tree algorithm Quinlan (Reference Quinlan1993). The J48 algorithm constructs a decision tree based on the training data. The decision tree is a hierarchical structure of nodes and branches, where each node represents a feature test, and each branch corresponds to a specific outcome or value of the feature. Given a new word instance to disambiguate, the algorithm traverses the decision tree by evaluating the feature tests at each node. It follows the appropriate branch based on the feature values of the instance until reaching a leaf node. The leaf node reached by the instance determines the disambiguated sense. Each leaf node is associated with a specific word sense, and the majority or most frequently occurring sense among the instances falling into that leaf node is chosen as the disambiguated sense for the new word instance.
4.4.5 Support vector machines (SVM)
SVMs are binary classifiers that can be extended for multiclass classification, which is suitable for WSD where multiple word senses need to be distinguished Navigli (Reference Navigli2009). The SVM algorithm learns a hyperplane that separates the feature vectors of different word senses in the training data. The goal is to find a hyperplane that maximizes the margin, i.e., the distance between the hyperplane and the nearest instances of each word sense. This hyperplane acts as the decision boundary for classification. Given a new word instance, the algorithm converts its features into a feature vector and applies the learned SVM model to classify the instance. The instance is assigned the word sense corresponding to the side of the decision boundary it falls into.
4.5 Evaluation
Experiments are carried out on all target words separately using classification algorithms discussed in the above section. To get a better overview of the performance reported by system and avoid complexity, we calculated the average of all the results produced by the system. To evaluate the performance of the WSD classifiers developed based on the created sense annotated corpus we used standard evaluation measures commonly used in machine learning, i.e., accuracy; precision, recall, and F1-measure.
Accuracy (A): Accuracy(A) is the accurate sense assignments made by the system over the total number of instances.
Precision (P): Precision (P) is obtained by dividing the correct sense assignments total assignments made by the system.
Recall (R): Recall (R) is obtained by dividing the correct assignments by the total expected assignments by the system.
F1-measure: Obtained by using equation (6) and is the harmonic mean of P and R.
5. Results and discussions
The machine learning algorithms used produced different results based on the feature sets used. Machine learning algorithms were trained using the training suites and then evaluated using the test suites. Results produced by various algorithms used on the basis of features are compared and are shown in Table 3.
WE-100 specifies that the feature vector for the target word is obtained using 100 words surrounding it, WE-200 specifies that 200 surrounding words are utilized and WE-300 specifies that 300 neighboring words are utilized to obtain the feature vector for the target word.
From Table 3, it is clear that all the algorithms used showed significant improvement in performance as compared to the baseline approach (MFS). Based on the feature set used to train the different machine learning WSD classifiers, there is a variance in their performance. SVM-based WSD model performed better in comparison to other WSD models with accuracy = 71.84%, precision = 0.71, recall = 0.70, and F1-measure = 0.70 when trained on BoW features. IBk-based WSD model trained on BoW features also performed well giving results close to SVM model with accuracy = 71.01%, precision = 0.70, recall = 0.69, and F1-measure = 0.69. Dl4jMlpClassifier-based WSD model follows IBk-based WSD model in performance giving accuracy = 70.43%, precision = 0.68, recall = 69, and F1-measure = 0.68. Performance of the Naïve Bayes WSD model is lower than Dl4jMlpClassifier with accuracy = 67.08%, precision = 0.66, recall = 0.64, and F1-measure = 0.65. The lowest results with accuracy = 64.23%, precision = 0.63, recall = 0.64, and F1-measure = 0.63 are produced by the WSD model based on decision-tree when used with BoW features.
On using Word2Vec-based features with dimensions 100, 200, and 300, it is observed that all WSD classifiers showed lower performance than the BoW-based approach when using dimension size 100 (WE-100). However, using dimension values as WE-200 and WE-300 improved the performance of WSD classifiers and outperformed the BoW-based model.
As given in Table 3 among the different WSD algorithms used, the Dl4jMlpClassifer-based WSD model produced the best results with accuracy = 73.74%, precision = 0.72, recall = 0.72, and F1-measure = 0.72, when trained with WE obtained by using the dimensional value 200 and decision tree-based WSD model produced lowest results with accuracy = 66.76%, precision = 0.65, recall = 0.65, and F1-measure = 0.65 using the same settings. The results shown in Table 3 clearly show that WSD models trained with word embedding perform better than WSD models trained using the BoW model in most cases. This led us to the conclusion that word embeddings are better at performing WSD in comparison to the BoW model.
From the overall results presented in Table 3, it may be observed irrespective of the feature extraction technique used WSD model created using the decision-tree algorithm does not perform to the level other WSD models perform. The main reason behind this may be that the decision tree does not get enough information from the features to make the sense prediction. On the other hand, the SVM-based WSD model makes better predictions in comparison to other WSD models when used with the BoW approach. The main reason behind the better performance of SVM may be attributed to its ability to handle data sparsity and being less sensitive to noise (Batuwita and Palade Reference Batuwita and Palade2013).
Table 4 shows the wrong sense predictions made by the Dl4jMlpClassifier for the target word (kaem). “Correct” specifies the sense assigned to the word (kaem) in the sense annotated dataset, whereas the “Predicted” specifies the sense assigned by the Dl4jMlpClassifier-based WSD classifier to the word (kaem) in the concerned instance. There are three senses present in the sense inventory for the word (kaem): (te ye karne eie), (palves wagaeres peth pane saet posh ya ther wagaeruk banavne aamut namoen), and (banavnech kaem). From Table 4, it is clear that mistakes are between Sense 2 and Sense 1 showing the behavior of the WSD algorithm.
WSD Model Behaviour
5.1 Proposed WSD model error analysis
When the erroneous results produced by different WSD models based on different machine learning algorithms are analyzed, it is observed that the main reason behind the errors in sense predictions is the higher number of exhibiting senses for ambiguous words. The distribution of instances for different senses of an ambiguous word in the training dataset is nonuniform. The dataset has more sentences for MFS of target words making it skewed toward MFS. The results produced by the proposed WSD system highlight that the ambiguous words with a smaller number of senses are disambiguated better than the ambiguous words with a larger number of senses, i.e., with the increase in the number of senses the performance of the system diminishes. Figure 5 shows the degree to which the accuracy of the WSD model gets impacted by the increase in the number of senses. The figure shows the average accuracy reported by the two best-performing WSD models one is the SVM model which performed best when the BoW model was used for feature extraction and the second is Dl4jMlpClassifier which outperformed other WSD models when word embeddings were used as features.
In Figure 5, the X-axis gives the number of senses for the target words which is constrained in this study between 2 and 8. The Y-axis on the left side gives the number of target words having a specific number of senses, whereas the Y-axis present on the right side gives the average accuracy returned by the classifier (SVM and Dl4jMlpClassifier). From the graph, the performance degradation with the increase in the number of senses of words is visible. This is in line with the research efforts presented in Saeed et al. (Reference Saeed, Nawab, Stevenson and Rayson2019) and Singh and Kumar (Reference Singh and Kumar2019).
It is also observed that in the dataset there are sentences where a particular ambiguous word exists with similar words in the context but gives different interpretations. For example, sentences (naves shahres manz raev tus weth te voet station) and (shahres manz raev tus weth te laejen cxoor karin) contain many similar content words, but they give different interpretations for the ambiguous word (weth). In such cases, it is difficult for the WSD model to make the right sense prediction.
Another important factor that has an influence on sense prediction is the presence of more than one ambiguous word in the same sentence. For example, in the sentence (ye weth che seud makans taam watan), words (weth) and (seud) are ambiguous, and it is difficult to predict the correct sense of both the words in such cases.
In order to collect the instances for the target words, we have used Kashmiri WordNet in addition to other resources, the sentences extracted from the Kashmiri WordNet are very short in many cases. These short sentences do not provide enough contextual information for the WSD model to make correct sense prediction. For example, the sentence (call khe cum) has only three words in it, and for a WSD model, it is difficult to make the right sense prediction for the word (call).
6. Conclusion
The main objective of this research work was to develop a standard Lexical Sample WSD dataset for Kashmiri. The developed Lexical Sample WSD dataset contains 124 frequently used ambiguous Kashmiri words and 19854 instances. The instances for the selected words were extracted from different resources, and Kashmiri Wordnet was used to develop the sense inventory. In addition to developing the sense-tagged dataset, the dataset was used to build WSD models to carry out disambiguation based on different feature sets. Five machine learning algorithms (SVM, J48, IBk, Naïve Bayes, and Dl4jMlpClassifier) were used to build WSD classifiers. The classifiers were trained using BoW model and word embeddings created using Word2Vec architecture. On the analysis of the results produced by different machine learning algorithms, it was observed word embeddings approach shows better performance than traditional BoW-based models. Among the different machine learning models, SVM showed the best performance when trained with BoW features. On the other hand, Dl4jMlpClassifier outperformed its counterparts when used with word embeddings. Out of the research work different observations were made like the performance of a WSD model is negatively impacted by the level of ambiguity a word has, the difficulty faced by the WSD model to predict the sense correctly when present in sentences with similar content words, shortage of contextual information provided by smaller sentences to make sense prediction. In the future, the dataset would be enhanced by incorporating sense-tagged instances for more ambiguous terms so as to increase language coverage. Also, deep learning techniques would be used to carry out experiments.