Hostname: page-component-cd9895bd7-gbm5v Total loading time: 0 Render date: 2024-12-24T03:56:25.714Z Has data issue: false hasContentIssue false

On generalization of the sense retrofitting model

Published online by Cambridge University Press:  31 March 2023

Yang-Yin Lee*
Affiliation:
Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan
Ting-Yu Yen
Affiliation:
Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan
Hen-Hsen Huang
Affiliation:
Institute of Information Science, Academia Sinica, Taipei, Taiwan
Yow-Ting Shiue
Affiliation:
Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan
Hsin-Hsi Chen
Affiliation:
Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan
*
*Corresponding author: E-mail: [email protected]
Rights & Permissions [Opens in a new window]

Abstract

With the aid of recently proposed word embedding algorithms, the study of semantic relatedness has progressed rapidly. However, word-level representations are still lacking for many natural language processing tasks. Various sense-level embedding learning algorithms have been proposed to address this issue. In this paper, we present a generalized model derived from existing sense retrofitting models. In this generalization, we take into account semantic relations between the senses, relation strength, and semantic strength. Experimental results show that the generalized model outperforms previous approaches on four tasks: semantic relatedness, contextual word similarity, semantic difference, and synonym selection. Based on the generalized sense retrofitting model, we also propose a standardization process on the dimensions with four settings, a neighbor expansion process from the nearest neighbors, and combinations of these two approaches. Finally, we propose a Procrustes analysis approach that inspired from bilingual mapping models for learning representations that outside of the ontology. The experimental results show the advantages of these approaches on semantic relatedness tasks.

Type
Article
Copyright
© The Author(s), 2023. Published by Cambridge University Press

1. Introduction

Models for the distributed representation of words (word embeddings) have drawn great interest in recent years because of their ability to acquire syntactic and semantic information from large unannotated corpora (Mikolov et al. Reference Mikolov, Chen, Corrado and Dean2013a; Pennington, Socher, and Manning Reference Pennington, Socher and Manning2014; Sun et al. Reference Sun, Guo, Lan, Xu and Cheng2016). Likewise, more and more ontologies have been compiled with high-quality lexical knowledge, including WordNet (Miller Reference Miller1998), Roget’s 21st Century Thesaurus (Roget) (Kipfer Reference Kipfer1993), and the paraphrase database (PPDB) (Pavlick et al. Reference Pavlick, Rastogi, Ganitkevitch, Van Durme and Callison-Burch2015). Based on lexical knowledge, early linguistic approaches such as the Leacock Chodorow similarity measure (Leacock and Chodorow Reference Leacock and Chodorow1998), the Lin similarity measure (Lin Reference Lin1998), and the Wu–Palmer similarity measure (Wu and Palmer Reference Wu and Palmer1994) have been proposed to compute semantic similarity. Although these linguistic resource-based approaches are somewhat logical and interpretable, they do not scale easily (in terms of vocabulary size). Furthermore, approaches based on modern neural networks outperform most linguistic resource-based approaches with better linearity.

While the recently proposed contextualized word representation models (Peters et al. Reference Peters, Neumann, Iyyer, Gardner, Clark, Lee and Zettlemoyer2018; Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019; Radford et al. Reference Radford, Wu, Child, Luan, Amodei and Sutskever2019) can have different representations given the context of a target word, evidence showed that the contextualized word representation may perform worse than static word embedding in some semantic relatedness datasets (Ethayarajh Reference Ethayarajh2019). Moreover, they may not be able to incorporate the knowledge in the ontologies into the models. On the contrary, some researches proposed models to incorporate word embedding models and lexical ontologies, using either joint training or post-processing (Yu and Dredze Reference Yu and Dredze2014; Faruqui et al. Reference Faruqui, Dodge, Jauhar, Dyer, Hovy and Smith2015). However, these word embedding models use only one vector to represent a word and are problematic in some natural language processing applications that require sense-level representation (e.g., word sense disambiguation and semantic relation identification). One way to take into account such polysemy and homonymy is to introduce sense-level embedding, via either pre-processing (Iacobacci, Pilehvar, and Navigli Reference Iacobacci, Pilehvar and Navigli2015) or post-processing (Jauhar, Dyer, and Hovy Reference Jauhar, Dyer and Hovy2015).

In this work, we focus on a post-processing sense retrofitting model GenSense (Lee et al. Reference Lee, Yen, Huang, Shiue and Chen2018), which is a generalized sense embedding learning framework that retrofits a pre-trained word embedding (i.e., Word2Vec Mikolov et al. Reference Mikolov, Chen, Corrado and Dean2013a, GolVe Pennington et al. Reference Pennington, Socher and Manning2014) with semantic relations between the senses, the relation strength, and the semantic strength.Footnote a The GenSense for generating low-dimensional sense embedding is inspired from the Retro sense model (Jauhar et al. Reference Jauhar, Dyer and Hovy2015) but has three major differences. First, it generalizes semantic relations from positive relations (e.g., synonyms, hyponyms, paraphrasing Lin and Pantel Reference Lin and Pantel2001; Dolan, Quirk, and Brockett Reference Dolan, Quirk and Brockett2004; Quirk, Brockett, and Dolan Reference Quirk, Brockett and Dolan2004; Ganitkevitch, Van Durme, and Callison-Burch Reference Ganitkevitch, Van Durme and Callison-Burch2013; Pavlick et al. Reference Pavlick, Rastogi, Ganitkevitch, Van Durme and Callison-Burch2015) to both positive and negative relations (e.g., antonyms). Second, each relation incorporates both semantic strength and relation strength. Within a semantic relation, there should be a weighting for each semantic strength. For example, although jewel has the synonyms gem and rock, it is clear that the similarity between (jewel, gem) is higher than (jewel, rock); thus, a good model should assign a higher weight to (jewel, gem). Last, GenSense assigns different relation strengths to different relations. For example, if the objective is to train a sense embedding that distinguishes between positive and negative senses, then the weight for negative relations (e.g., antonyms) should be higher, and vice versa. Experimental results suggest that relation strengths play an important role in balancing relations and are application dependent. Given an objective function that takes into consideration these three parts, sense vectors can be learned and updated using a belief propagation process on the relation constrained network. A constraint on the update formula is also considered using a threshold criterion.

Apart from the GenSense framework, some work suggests using a standardization process to improve the quality of vanilla word embeddings (Lee et al. Reference Lee, Ke, Huang and Chen2016). Thus, we propose a standardization process on GenSense’s embedding dimensions with four settings, including (1) performing standardization after all of the iteration process (once); (2) performing standardization after every iteration (every time); (3) performing standardization before the sense retrofitting process (once); and (4) performing standardization before each iteration of the sense retrofitting process (every time). We also propose a sense neighbor expansion process from the nearest neighbors; this is added into the sense update formula to improve the quality of the sense embedding. Finally, we combine the standardization process and neighbor expansion process in four different ways: (1) GenSense with neighbor expansion, followed by standardization (once); (2) GenSense with neighbor expansion, followed by standardization in each iteration (each time); (3) standardization and then retrofitting of the sense vectors with neighbor expansion (once); and (4) in each iteration, standardization and then retrofitting of the sense vectors with neighbor expansion (each time).

Though GenSense can retrofit sense vectors connected within a given ontology, words outside of the ontology are not learned. To address this issue, we introduce a bilingual mapping method (Mikolov, Le, and Sutskever Reference Mikolov, Le and Sutskever2013b; Xing et al. Reference Xing, Wang, Liu and Lin2015; Artetxe, Labaka, and Agirre Reference Artetxe, Labaka and Agirre2016; Smith et al. Reference Smith, Turban, Hamblin and Hammerla2017; Joulin et al. Reference Joulin, Bojanowski, Mikolov, Jégou and Grave2018) for learning the mapping between the original word embedding and the learned sense embedding that utilize the Procrustes analysis. The goal of orthogonal Procrustes analysis is to find a transformation matrix W such that the representations before the sense retrofitting are close to the representations after the sense retrofitting. After obtaining W, we can apply the transformation matrix to the senses that are not retrofitted.

In the experiments, we show that the GenSense model outperforms previous approaches on four types of datasets: semantic relatedness, contextual word similarity, semantic difference, and synonym selection. With an experiment to evaluate the benefits yielded by the relation strength, we find a $87.7$ % performance difference between the worst and the best cases in WordSim-353 semantic relatedness benchmark dataset (Finkelstein et al. Reference Finkelstein, Gabrilovich, Matias, Rivlin, Solan, Wolfman and Ruppin2002). While the generalized model which considers all the relations performs well in the semantic relatedness tasks, we also find that antonym relations perform particularly well in the semantic difference experiment. We also find that the proposed standardization, neighbor expansion, and the combination of these two processes improve performance on the semantic relatedness experiment.

The remainder of this paper is organized as follows. Section 2 introduces some related works. Section 3 describes our proposed generalized sense retrofitting model. In Section 4, we show the Procrustes analysis on GenSense. Section 5 describes the details of the experiments. The results discussions are shown in Section 6. In Section 7, we show the limitation of the model and point out some future directions. Finally, we conclude this research in Section 8.

2. Related work

The study of word representations has a long history. Early approaches include utilizing the term-document occurrence matrix from a large corpus and then using dimension reduction techniques such as singular value decomposition (Deerwester et al. Reference Deerwester, Dumais, Furnas, Landauer and Harshman1990; Bullinaria and Levy Reference Bullinaria and Levy2007). Beyond that, recent unsupervised word embedding approaches (sometimes referred to as corpus-based approaches) based on neural networks (Mikolov et al. Reference Mikolov, Chen, Corrado and Dean2013a; Pennington et al. Reference Pennington, Socher and Manning2014; Dragoni and Petrucci Reference Dragoni and Petrucci2017) have performed well on syntactic and semantic tasks. Among these, Word2Vec word embeddings were released using the continuous bag-of-words (CBOW) model and the skip-gram model (Mikolov et al. Reference Mikolov, Chen, Corrado and Dean2013a). The CBOW model predicts the center word using contextual words, while the skip-gram model predicts contextual words using the center word. GloVe, another widely adopted word embedding model, is a log-bilinear regression model that mitigates the drawbacks of global factorization approaches (such as latent semantic analysis; Deerwester et al. Reference Deerwester, Dumais, Furnas, Landauer and Harshman1990) and local context window approaches (such as the skip-gram model) on the word analogy and semantic relatedness tasks (Pennington et al. Reference Pennington, Socher and Manning2014). The global vectors in GloVe for word embedding are trained using unsupervised learning on aggregated global word–word co-occurrence statistics from a corpus; this encodes the relationship between words and yields vectorized representations for words that satisfy ratios between words. The objective functions of Word2Vec and GloVe are slightly different. Word2Vec utilizes negative sampling to make words that do not frequently co-occur more dissimilar, whereas GloVe instead uses a weighting function to adjust the word–word co-occurrence counts; Word2Vec does not use this method. To deal with the out-of-vocabulary (oov) issue in word representations, FastText (Bojanowski et al. Reference Bojanowski, Grave, Joulin and Mikolov2017) is a more advanced model which leverages subword information. For example, when considering the word asset with tri-gram, it will be represented by the following character tri-gram: $\lt$ as, ass, sse, set, et $\gt$ . This technique not only resolved the oov issue but also better represented low frequency words.

Apart from unsupervised word embedding learning models, there exist ontologies that contain lexical knowledge such as WordNet (Miller Reference Miller1998), Roget’s 21st Century Thesaurus (Kipfer Reference Kipfer1993), and PPDB (Pavlick et al. Reference Pavlick, Rastogi, Ganitkevitch, Van Durme and Callison-Burch2015). Although these ontologies are useful in some applications, different ontologies contain different structure. In Roget, a synonym set contains all the words of the same sense and has its unique definition. For example, the word free in Roget has at least two adjective senses: (free, complimentary, comp, unrecompensed) and (free, available, clear). The definition of the first sense (without charge) is different from the second sense (not busy; unoccupied). The synonym’s relevance can be different in each set: the ranking in the first sense of free is (free, complimentary) > (free, comp) > (free, unrecompensed). PPDB is an automatically created massive resource of paraphrases of three types: lexical, phrasal, and syntactic (Ganitkevitch et al. Reference Ganitkevitch, Van Durme and Callison-Burch2013; Pavlick et al. Reference Pavlick, Rastogi, Ganitkevitch, Van Durme and Callison-Burch2015). For each type, there are several sizes with different trade-offs in terms of precision and recall. Each pair of words is semantically equivalent in some degree. For example, (automobile, car, auto, wagon, …) is listed in the coarsest size, while the finest size contains only (automobile, car, auto).

As the development of these lexical ontologies and word embedding models has matured, many have attempted combining them either with joint training (Bian, Gao, and Liu Reference Bian, Gao and Liu2014; Yu and Dredze Reference Yu and Dredze2014; Bollegala et al. Reference Bollegala, Alsuhaibani, Maehara and Kawarabayashi2016; Liu, Nie, and Sordoni Reference Liu, Nie and Sordoni2016; Mancini et al. Reference Mancini, Camacho-Collados, Iacobacci and Navigli2017) or post-processing (Faruqui et al. Reference Faruqui, Dodge, Jauhar, Dyer, Hovy and Smith2015; Ettinger, Resnik, and Carpuat Reference Ettinger, Resnik and Carpuat2016; Mrkšic et al. Reference Mrkšic, OSéaghdha, Thomson, Gašic, Rojas-Barahona, Su, Vandyke, Wen and Young2016; Lee et al. Reference Lee, Yen, Huang and Chen2017; Lengerich et al. Reference Lengerich, Maas and Potts2017; Glavaš and Vulić Reference Glavaš and Vulić2018). When the need for sense embedding becomes more obvious, some researches focus on learning sense-level embedding with lexical ontology.

Joint training for sense embedding utilizes information contained in the lexical database during the intermediate word embedding generation steps. For example, as the SensEmbed model utilizes Babelfy to annotate the Wikipedia corpus, it generates sense-level representations (Iacobacci et al. Reference Iacobacci, Pilehvar and Navigli2015). NASARI uses WordNet and Wikipedia to generate word-based and synset-based representations and then linearly combines the two embeddings (Camacho-Collados, Pilehvar, and Navigli Reference Camacho-Collados, Pilehvar and Navigli2015). Mancini et al. (Reference Mancini, Camacho-Collados, Iacobacci and Navigli2017) proposed to learn word and sense embeddings in the same space via a joint neural architecture.

In contrast, this research focuses on post-processing approach. For retrofitting on word embedding, a new word embedding model is learned by retrofit (refine) the pre-trained word embedding with the lexical database’s information. One of the advantages of post-processing is the nonnecessity of training a word embedding from scratch, which often takes huge amounts of time and computation power. Faruqui et al. (Reference Faruqui, Dodge, Jauhar, Dyer, Hovy and Smith2015) proposed an objective function for retrofitting which minimizes the Euclidean distance of synonymic or hypernym–hyponym relation words in WordNet, while at the same time it preserves the original word embedding’s structure. Similar to retrofitting, a counter-fitting model is proposed to not only minimize the distance between vectors of words with synonym relations but also maximize the distance between vector of words with antonym relations (Mrkšic et al. Reference Mrkšic, OSéaghdha, Thomson, Gašic, Rojas-Barahona, Su, Vandyke, Wen and Young2016). Their qualitative analysis shows that before counter-fitting, words are related but not similar. After counter-fitting, the closest words are similar words.

For post-processing sense models, the Retro model (Jauhar et al. Reference Jauhar, Dyer and Hovy2015) applies graph smoothing with WordNet as a retrofitting step to tease the vectors of different senses apart. Li and Jurafsky (Reference Li and Jurafsky2015) proposed to learn sense embedding through Chinese restaurant processes and show a pipelined architecture for incorporating sense embeddings into language understanding. Ettinger et al. (Reference Ettinger, Resnik and Carpuat2016) proposed to use parallel corpus to build sense graph and then perform retrofitting on the constructed sense graph. Yen et al. (Reference Yen, Lee, Huang and Chen2018) proposed to learn sense embedding through retrofitting on sense and contextual neighbors jointly; however, the negative relations were not considered in their model. Remus and Biemann (Reference Remus and Biemann2018) used unsupervised sense inventory to perform retrofitting on word embedding to learn the sense embedding, though the quality of the unsupervised sense inventory is questionable. Although it has been shown that sense embedding does not improve every natural language processing task (Li and Jurafsky Reference Li and Jurafsky2015), there is still a great need for sense embedding for tasks that need sense-level representation (i.e., synonym selection, word similarity rating, and word sense induction) (Azzini et al. Reference Azzini, da Costa Pereira, Dragoni and Tettamanzi2011; Ettinger et al. Reference Ettinger, Resnik and Carpuat2016; Qiu, Tu, and Yu Reference Qiu, Tu and Yu2016). A survey on word and sense embeddings can be found in Camacho-Collados and Pilehvar (Reference Camacho-Collados and Pilehvar2018). After the proposal of the transformer model, researchers either utilize the transformer model directly (Wiedemann et al. Reference Wiedemann, Remus, Chawla and Biemann2019) or trained with ontologies to better capture the sense of a word in specific sentences (Shi et al. Reference Shi, Chen, Zhou and Chang2019; Loureiro and Jorge Reference Loureiro and Jorge2019); Scarlini, Pasini, and Navigli Reference Scarlini, Pasini and Navigli2020).

3. Generalized sense retrofitting model

3.1. The GenSense model

The GenSense model is to learn a better sense representation such that each new representation is close to its word form representation, its synonym neighbors, and its positive contextual neighbors, while actively pushing away from its antonym neighbors and its negative contextual neighbors (Lee et al. Reference Lee, Yen, Huang, Shiue and Chen2018). Let $V=\left\{w_1,...,w_n\right\}$ be a vocabulary of a trained word embedding and $|V|$ be its size. The matrix $\hat{Q}$ is the pre-trained collection of vector representations $\hat{Q}_i\in \mathbb{R}^d$ , where d is the dimensionality of a word vector. Each $w_i\in V$ is learned using a standard word embedding technique (e.g., GloVe Pennington et al. Reference Pennington, Socher and Manning2014 or word2vec Mikolov et al. Reference Mikolov, Chen, Corrado and Dean2013a). Let $\Omega=\left(T,E\right)$ be an ontology that contains the semantic relationship, where $T=\left\{t_1,...,t_m\right\}$ is a set of senses and $|T|$ the total number of senses. In general, $|T|>|V|$ since one word may contain more than one sense. For example, in WordNet the word gay has at least two senses $gay.n.01$ (the first noun sense; homosexual, homophile, homo, gay) and $gay.a.01$ (the first adjective sense; cheery, gay, sunny). Edge $\left(i,j\right)\in E$ indicates a semantic relationship of interest (e.g., synonym) between $t_i$ and $t_j$ . In our scenario, the edge set E consists of several disjoint subsets of interest. For example, the set of synonyms and antonyms as there is no case of a semantic relationship of a pair of senses belong to synonym and antonym at the same time, and thus $E=E_{r_1}\cup E_{r_2}\cup ... \cup E_{r_k}$ . If $r_1$ denotes the synonym relationship, then $\left(i,j\right)\in E_{r_1}$ if and only if $t_j$ is the synonym of $t_i$ . We use $\hat{q}_{t_j}$ to denote the word form vector of $t_j$ .Footnote b Then the goal is to learn a new matrix $S=\left(s_1,...,s_m\right)$ such that each new sense vector is close to its word form vertex and its synonym neighbors. The basic form that considers only synonym relations for the objective of the sense retrofitting model is

(1) \begin{equation}\sum_{i=1}^m\left[ \alpha_1 \beta_{ii}\|s_i-\hat{q}_{t_i}\|^2+\alpha_2 \sum_{\left(i,k\right)\in E_{r_1}}\beta_{ij}\|s_i-s_k\|^2\right]\end{equation}

where the $\alpha$ ’s balance the importance of the word form vertex and the synonym, and the $\beta$ ’s control the strength of the semantic relations. When $\alpha_1=0$ and $\alpha_2=1$ , the model only considers the synonym neighbors and may be too deviate from the original vector. From Equation (1), a learned sense vector approaches its synonyms, meanwhile constraining its distance with its original word form vector. In addition, this equation can be further generalized to consider all relations as

(2) \begin{equation}\sum_{i=1}^m\left[\alpha_1\beta_{ii}\|s_i-\hat{q}_{t_i}\|^2+\alpha_2\sum_{\left(i,k\right)\in E_{r_1}}\beta_{ij}\|s_i-s_k\|^2+\dots\right].\end{equation}

Apart from the positive sense relation, we now introduce three types of special relations. The first is the positive contextual neighbor relation $r_2$ . $\left(i,j\right)\in E_{r_2}$ if and only if $t_j$ is the synonym of $t_i$ and the surface form of $t_j$ has only one sense. In the model, we use the word form vector to represent the neighbors of the $t_i$ ’s in $E_{r_2}$ . These neighbors are viewed as positive contextual neighbors, as they are learned from the context of a corpus (e.g., Word2Vec trained on the Google News corpus) with positive meaning. The second is the negative sense relation $r_3$ . $\left(i,j\right)\in E_{r_3}$ if and only if $t_j$ is the antonym of $t_i$ . The negative senses are used in a subtractive fashion to push the sense away from the positive meaning. The last is the negative contextual neighbor relation $r_4$ . $\left(i,j\right)\in E_{r_4}$ if and only if $t_j$ is the antonym of $t_i$ and the surface form of $t_j$ has only one sense. As with the positive contextual neighbors, negative contextual neighbors are learned from the context of a corpus, but with negative meaning. Table 1 summarizes the aforementioned relations.

Table 1. Summary of the semantic relations. $sf(t_i)$ is the word surface form of sense $t_i$

In Figure 1, which contains an example of the relation network, the word gay may have two meanings: (1) bright and pleasant; promoting a feeling of cheer and (2) someone who is sexually attracted to persons of the same sex. If we focus on the first sense, then our model attracts $s_{gay_1}$ to its word form vector $\hat{q}_{gay_1}$ , its synonym $s_{glad_1}$ , and its positive contextual neighbor $\hat{q}_{jolly}$ . At the same time, it pushes $s_{gay_1}$ from its antonym $s_{sad_1}$ and its negative contextual neighbor $\hat{q}_{dull}$ .

Figure 1. An illustration of the relation network. Different node textures represent different roles (e.g., synonym and antonym) in the GenSense model.

Formalizing the above scenario and considering all its parts, Equation (2) becomes

(3) \begin{equation}\begin{aligned}&\sum_{i=1}^m\Biggl[\alpha_1\beta_{ii}\|s_i-\hat{q}_{t_i}\|^2+\alpha_2\sum_{\left(i,j\right)\in E_{r_1}}\beta_{ij}\|s_i-s_j\|^2+\alpha_3\sum_{\left(i,j\right)\in E_{r_2}}\beta_{ij}\|s_i-\hat{q}_j\|^2 \\[5pt] &+\alpha_4\sum_{\left(i,j\right)\in E_{r_3}}\beta_{ij}\|s_i+s_j\|^2+\alpha_5\sum_{\left(i,j\right)\in E_{r_4}}\beta_{ij}\|s_i+\hat{q}_j\|^2\Biggr].\end{aligned}\end{equation}

We therefore apply an iterative updating method to the solution of the above convex objective function (Bengio, Delalleau, and Le Roux Reference Bengio, Delalleau and Le Roux2006). Initially, the sense vectors are set to their corresponding word form vectors (i.e., $s_i\leftarrow\hat{q}_{t_i}\forall i$ ). Then in the following iterations, the updating formula for $s_i$ is

(4) \begin{equation}s_i=\frac{\left[\begin{aligned}&-\alpha_5\sum_{j:\left(i,j\right)\in E_{r_4}}\beta_{ij}\hat{q}_j-\alpha_4\sum_{j:\left(i,j\right)\in E_{r_3}}\beta_{ij}s_j \\[5pt] &+\alpha_3\sum_{j:\left(i,j\right)\in E_{r_2}}\beta_{ij}\hat{q}_j+\alpha_2\sum_{j:\left(i,j\right)\in E_{r_1}}\beta_{ij}s_j+\alpha_1\beta_{ii}\hat{q}_{t_i}\end{aligned}\right]}{\left[\begin{aligned}&\alpha_5\sum_{j:\left(i,j\right)\in E_{r_4}}\beta_{ij}+\alpha_4\sum_{j:\left(i,j\right)\in E_{r_3}}\beta_{ij} \\[5pt] +&\alpha_3\sum_{j:\left(i,j\right)\in E_{r_2}}\beta_{ij}+\alpha_2\sum_{j:\left(i,j\right)\in E_{r_1}}\beta_{ij}+\alpha_1\beta_{ii}\end{aligned}\right]}\end{equation}

A formal description of the GenSense method is shown in Algorithm 1 (Lee et al. Reference Lee, Yen, Huang, Shiue and Chen2018), in which the $\beta$ parameters are retrieved from the ontology, and $\varepsilon$ is a threshold for deciding whether to update the sense vector or not, and as such is used as a stopping criterion when the difference between the new sense vector and the original sense vector is small. Empirically, 10 iterations are sufficient to minimize the objective function from a set of starting vectors to produce effective sense-retrofitted vectors. Based on the GenSense model, the next three subsections will describe three approaches to further improve the sense representations.

Algorithm 1. GenSense

3.2. Standardization on dimensions

Although the original GenSense model considers the semantic relations between the senses, the relation strength, and the semantic strength, the literature indicates that the vanilla word embedding model benefits from standardization on the dimensions (Lee et al. Reference Lee, Ke, Huang and Chen2016). In this approach, let $1\leq j\leq d$ be the d dimensions in the sense embedding. Then for every sense vector $s_i\in\mathbb{R}^d$ , the z-score is computed on each dimension as

(5) \begin{equation}s_{ij}^{\prime}=\frac{s_{ij}-\mu}{\sigma},\forall i,j\end{equation}

where $\mu$ is the mean and $\sigma$ is the standard deviation of dimension j. After this process, the sense vector is then divided by its norm to ensure a summation of 1:

(6) \begin{equation}s_i^{\prime\prime}=\frac{s_i^{\prime}}{\left\|s_i^{\prime}\right\|},\forall i\end{equation}

where $\left\|s_i\right\|$ is the norm of the sense vector $s_i$ . As this standardization process can be placed in multiple places, we consider the following four situations:

  1. (1) GenSense-Z: the standardization process is performed after every iteration.

  2. (2) GenSense-Z-last: the standardization process is performed only at the end of the whole algorithm.

  3. (3) Z-GenSense: the standardization process is performed at the beginning of each iteration.

  4. (4) Z-first-GenSense: the standardization process is performed only once, before iteration.

The details of this approach are shown in Algorithms 2 and 3. Note that although further combinations or adjustments of these situations are possible, in the experiments we analyze only these four situations.

Algorithm 2. Standardization Process

3.3. Neighbor expansion from nearest neighbors

In this approach, we utilize the nearest neighbors of the target sense vector to refine GenSense. Intuitively, if the sense vector $s_i$ ’s nearest neighbors are uniformly distributed around $s_i$ , then they may not be helpful. In contrast, if the neighbors are clustered and gathered in a distinct direction, then utilization of these neighbors is crucial. Figure 2 contains examples of nearest neighbors that may or may not be helpful. In Figure 2(a), cheap s neighbors are not helpful since they are uniformly distributed and thus make the effect of the neighbors canceled. In Figure 2(b), love s neighbors are helpful since they are gathered in the same quadrant and make the new sense vector of love closer to its related senses.

In practice, we pre-build the sense embedding k-d tree for rapid lookups of the nearest neighbors of vectors (Maneewongvatana and Mount Reference Maneewongvatana and Mount1999). After building the k-d tree and take into consideration of the nearest neighbor term, the update formula for $s_i$ becomes

(7) \begin{equation}s_i=\frac{\left[\begin{aligned}&\alpha_6\sum_{j:\left(i,j\right)\in NN\left(s_i\right)}\beta_{ij}s_j&-\alpha_5\sum_{j:\left(i,j\right)\in E_{r_4}}\beta_{ij}\hat{q}_j&-\alpha_4\sum_{j:\left(i,j\right)\in E_{r_3}}\beta_{ij}s_j \\[5pt] +&\alpha_3\sum_{j:\left(i,j\right)\in E_{r_2}}\beta_{ij}\hat{q}_j&+\alpha_2\sum_{j:\left(i,j\right)\in E_{r_1}}\beta_{ij}s_j&+\alpha_1\beta_{ii}\hat{q}_{t_i}\end{aligned}\right]}{\left[\begin{aligned}&\alpha_6\sum_{j:\left(i,j\right)\in NN\left(s_i\right)}\beta_{ij}&+\alpha_5\sum_{j:\left(i,j\right)\in E_{r_4}}\beta_{ij}&+\alpha_4\sum_{j:\left(i,j\right)\in E_{r_3}}\beta_{ij} \\[5pt] +&\alpha_3\sum_{j:\left(i,j\right)\in E_{r_2}}\beta_{ij}&+\alpha_2\sum_{j:\left(i,j\right)\in E_{r_1}}\beta_{ij}&+\alpha_1\beta_{ii}\end{aligned}\right]}\end{equation}

where $NN\left(s_i\right)$ is the set of N nearest neighbors of $s_i$ and $\alpha_6$ is a newly added parameter for weighting the importance of the nearest neighbors. Details of the proposed neighbor expansion approach are shown in Algorithm 4. The main procedure of Algorithm 4 is similar to that of Algorithm 1 (GenSense) with two differences: (1) in line 4 we need to build the k-d tree and (2) in line 7 we need to compute the nearest neighbors for Equation (7).

Algorithm 3. GenSense with Standardization Process

Figure 2. Nearest neighbors of cheap (a) and love (b). In (a), cheap s neighbors are uniformly distributed and thus not helpful. In (b), love s neighbors are gathered in the quadrant I, and thus can be attracted toward them.

Algorithm 4. GenSense-NN

3.4. Combination of standardization and neighbor expansion

With the standardization and neighbor expansion approaches, a straightforward and natural way to further improve the sense embedding’s quality is to combine these two approaches. In this study, we propose four combination situations:

  1. (1) GenSense-NN-Z: in each iteration, GenSense is performed with neighbor expansion, after which the sense embedding is standardized.

  2. (2) GenSense-NN-Z-last: in each iteration, GenSense is performed with neighbor expansion. The standardization process is performed only after the last iteration.

  3. (3) GenSense-Z-NN: in each iteration, the sense embedding is standardized and GenSense is performed with neighbor expansion.

  4. (4) GenSense-Z-NN-first: standardization is performed only once, before the iteration process. After that, GenSense is performed with neighbor expansion.

As with standardization on the dimensions, although different combinations of the standardization and neighbor expansion approaches are possible, we analyze only these four situations in our experiments.

4. Procrustes analysis on GenSense

Although the GenSense model retrofits sense vectors connected within a given ontology, words outside of the ontology are not learned. We address this by introducing a bilingual mapping method (Mikolov et al. Reference Mikolov, Le and Sutskever2013b; Xing et al. Reference Xing, Wang, Liu and Lin2015; Artetxe et al. Reference Artetxe, Labaka and Agirre2016; Smith et al. Reference Smith, Turban, Hamblin and Hammerla2017; Joulin et al. Reference Joulin, Bojanowski, Mikolov, Jégou and Grave2018) for learning the mapping between the original word embedding and the learned sense embedding. Let $\left\{x_i,y_i\right\}_{i=1}^n$ be the pairs of corresponding representations before and after sense retrofitting, where $x_i\in\mathbb{R}^d$ is the representation before sense retrofitting and $y_i\in\mathbb{R}^d$ is that after sense retrofitting.Footnote c The goal of orthogonal Procrustes analysis is to find a transformation matrix W such that $Wx_i$ approximates $y_i$ :

(8) \begin{equation}\min_{W\in\mathbb{R}^{d\times d}}\frac{1}{n}\sum_{i=1}^{n}l\!\left(Wx_i,y_i\right), \text{ subject to } W^{T}W=I_d\end{equation}

where we select the typical square loss $l_2\!\left(x,y\right)=\left\|x-y\right\|_2^2$ as the loss function. When constraining W to be orthogonal (i.e., $W^{T}W=I_d$ , where $I_d$ is an identity matrix with d-dimensionality and the dimensionality of the representation before and after retrofitting is the same), selecting the square loss function makes formula 8 a least-squares problem, solvable with a closed-form solution. From formula 8,

(9) \begin{equation}\begin{aligned}&\mathop{{\textrm{arg min}}}_{W\in\mathbb{R}^{d\times d}}\frac{1}{n}\sum_{i=1}^{n}l\!\left(Wx_i,y_i\right) \\[5pt] =&\mathop{{\textrm{arg min}}}_{W\in\mathbb{R}^{d\times d}}\sum_{i=1}^{n}\left\|Wx_i-y_i\right\|_2^2 \\[5pt] =&\mathop{{\textrm{arg min}}}_{W\in\mathbb{R}^{d\times d}}\sum_{i=1}^{n}\left\|Wx_i\right\|_2^2-2y_i^{T}Wx_i+\left\|y_i\right\|_2^2 \\[5pt] =&\mathop{{\textrm{arg min}}}_{W\in\mathbb{R}^{d\times d}}\sum_{i=1}^{n}y_i^{T}Wx_i.\end{aligned}\end{equation}

Let $X=\left(x_1,...,x_n\right)$ and $Y=\left(y_1,...,y_n\right)$ . Equation (9) can be expressed as

(10) \begin{equation}\mathop{{\textrm{arg max}}}_{W\in\mathbb{R}^{d\times d}}\textrm{Tr}\!\left(Y^{T}WX\right)\end{equation}

where $\textrm{Tr}\!\left(\cdot\right)$ is the trace operator $\textrm{Tr}\!\left(A\right)=\sum_{i=1}^{n}a_{ii}$ . We first rearrange the matrices in the trace operator

(11) \begin{equation}\textrm{Tr}\!\left(Y^{T}WX\right)=\textrm{Tr}\!\left(WXY^{T}\right);\end{equation}

then, using the singular value decomposition $XY^{T}=U\Sigma V^{T}$ , formula 11 becomes

(12) \begin{equation}\textrm{Tr}\!\left(WU\Sigma V^{T}\right)=\textrm{Tr}\!\left(V^{T}WU\Sigma\right).\end{equation}

Since $V^{T}$ , W, and U are orthogonal matrices, $P=V^{T}WU$ must be an orthogonal matrix. From Equation (12), it follows that

(13) \begin{equation}\textrm{Tr}\!\left(V^{T}WU\Sigma\right)=\textrm{Tr}\!\left(P\Sigma\right)=\sum_{i=1}^{n}p_{ii}\sigma_i\leq\sum_{i=1}^{n}\left|p_{ii}\right|\sigma_i\leq\sum_{i=1}^n\sigma_i.\end{equation}

From Equation (13), $\left|p_{ii}\right|\leq 1$ ; given the orthogonality of P, $P=I_n$ . As a result, $V^{T}WU=I$ and $W=VU^{T}$ .

4.1. Inference from Procrustes method

After obtaining the transformation matrix W, the next step is to infer the sense representations that cannot be retrofitted from the GenSense model (out-of-ontology senses). For a bilingual word embedding, there are two methods for representing these mappings. The first is finding a corresponding mapping of the word that is to be translated:

(14) \begin{equation}t\!\left(i\right)\in\mathop{{\textrm{arg min}}}_{j\in\left\{1,..,n,n+1,...,N\right\}}\left\|Wx_i-y_i\right\|_2^2\end{equation}

for word i, where $t\!\left(i\right)$ denotes the translation and $\left\{1,...,n,n+1,...,N\right\}$ denotes the vocabulary of the target language. Though this process is commonly used in bilingual translation, there is a simpler approach for retrofitting. The second method is simply to apply the transformation matrix on the word that is to be translated. However, in bilingual embedding this method yields only the mapped vector and not the translated word in the target language. One advantage with GenSense is that all corresponding senses (those before and after applying GenSense) are known. As a result, we simply apply the transformation matrix to the senses that are not retrofitted (i.e., $Wx_i$ for sense i). In the experiments, we show only the results of the second method, as it is more natural within the context of GenSense.

5. Experiments

We evaluated GenSense using four types of experiments: semantic relatedness, contextual word similarity, semantic difference, and synonym selection. In the testing phase, if a test dataset had missing words, we used the average of all sense vectors to represent the missing word. Note that the results we report for vanilla sense embedding may be slightly different from other work due to the handling of missing words and the similarity computation method. Some work uses a zero vector to represent missing words, whereas some removes missing words from the dataset. Thus, the reported performance should be compared given the same missing word processing method and the same similarity computation method.

5.1. Ontology construction

Roget’s 21st Century Thesaurus (Kipfer Reference Kipfer1993) was used to build the ontology in the experiments as it includes strength information for the senses. As Roget does not directly provide an ontology, we used it to manually construct synonym and antonym ontologies. We first fetched all the words with their sense sets. Let Roget’s word vocabulary be $V=\left\{w_1,...,w_n\right\}$ and the initial sense vocabulary be $\big\{w_{11},w_{12}, ..., w_{1m_1}, ...,w_{n1},w_{n2},...,w_{nm_n}\big\}$ , where word $w_i$ has $m_i$ senses, including all parts of speeches (POSes) of $w_i$ . For example, love has four senses: (1) noun, adoration; very strong liking; (2) noun, person who is loved by another; (3) verb, adore, like very much; and (4) verb, have sexual relations. The initial sense ontology would be $O=\big\{W_{11},W_{12},...,W_{1m_1},...,W_{n1},W_{n2},...,W_{nm_n}\big\}$ . In the ontology, each word $w_i$ had $m_i$ senses, of which $w_{i1}$ was the default sense. The default sense is the first sense in Roget and is usually the most common sense when people use this word. Each $W_{ij}$ carried an initial word set $\left\{w_k|w_k\in V, w_k\text{ is the synonym of }w_{ij}\right\}$ . For example, bank has two senses: $bank_1$ and $bank_2$ . The initial word set of $bank_1$ is store and treasury (which refers to financial institution). The initial word set of $bank_2$ is beach and coast (which refers to ground bounding waters). Then we attempted to assign a corresponding sense to the words in the initial word set. For each word $w_k$ in the word set of $W_{ij}$ , we first computed all the intersections of $W_{ij}$ with $W_{k1},W_{k2},...,W_{km_k}$ , after which we selected the largest intersection according to the cardinalities. If all the cardinalities were zero, then we assigned the default sense to the target word. The procedure for the construction of the ontology for Roget’s thesaurus is given in Algorithm 5. The building of the antonym ontology from Roget’s thesaurus was similar to Algorithm 5; it differed in that the initial word set was set to $\left\{w_k|w_k\in V,w_k\text{ is the antonym of } w_{ij}\right\}$ .

Algorithm 5. Construction of ontology given Roget’s senses

The vocabulary from the pre-trained GloVe word embedding was used to fetch and build the ontology from Roget. In Roget, the synonym relations were provided in three relevance levels; we set the $\beta$ ’s to $1.0$ , $0.6$ , and $0.3$ for the most relevant synonyms to the least. The antonym relations were constructed in the same way. For each sense, $\beta_{ii}$ was set to the sum of all the relation’s specific weights. Unless specified otherwise, in the experiments we set the $\alpha$ ’s to 1. Although in this study we show only the Roget results, other ontologies (e.g., PPDB Ganitkevitch et al. Reference Ganitkevitch, Van Durme and Callison-Burch2013; Pavlick et al. Reference Pavlick, Rastogi, Ganitkevitch, Van Durme and Callison-Burch2015 and WordNet Miller Reference Miller1998) could be incorporated into GenSense as well.

5.2. Semantic relatedness

Measuring semantic relatedness is a common way to evaluate the quality of the proposed embedding models. We downloaded four semantic relatedness benchmark datasets from the web.

5.2.1. MEN

MEN (Bruni, Tran, and Baroni Reference Bruni, Tran and Baroni2014)) contains 3000 word pairs crowdsourced from Amazon Mechanical Turk. Each word pair has a similarity score ranging from 0 to 50. In their crowdsourcing procedure, the annotators were asked to select the more related word pair from two candidate word pairs. For example, between the candidates $\left(wheels, car\right)$ and $\left(dog, race\right)$ , the annotators were to select $\left(wheels, car\right)$ since every car has wheels, but not every dog is involved in a race. We further labeled the POS in MEN: 81% were nouns, 7% were verbs, and 12% were adjectives. In the MEN dataset, there are two versions of the word pairs: lemma and natural form. We show the natural form in the experimental results, but the performance on the two datasets is quite similar.

5.2.2. MTurk

MTurk (Radinsky et al. Reference Radinsky, Agichtein, Gabrilovich and Markovitch2011) contains 287 human-labeled examples of word semantic relatedness. Each word pair has a similarity score ranging from 1 to 5 from 10 subjects. A higher score value indicates higher similarity. In MTurk, we labeled the POS: 61% were nouns, 29% were verbs, and 10% were adjectives.

5.2.3. WordSim353 (WS353)

WordSim-353 (Finkelstein et al. Reference Finkelstein, Gabrilovich, Matias, Rivlin, Solan, Wolfman and Ruppin2002) contains 353 noun word pairs. Each pair has a human-rated similarity score ranging from 0 to 10. A higher score value indicates higher semantic similarity. For example, the score of $\left(journey, voyage\right)$ is $9.29$ and the score of $\left(king, cabbage\right)$ is $0.23$ .

5.2.4. Rare words

Rare words (RWs) (Luong, Socher, and Manning Reference Luong, Socher and Manning2013) contain 2034 word pairs crowdsourced from Amazon Mechanical Turk. Each word pair has a similarity score ranging from 0 to 10. A higher score value indicates higher similarity. In RW, frequencies of some words are very low. Table 2 shows the word frequency statistics of WS353 and RW based on Wikipedia.

Table 2. Word frequency distributions for WS353 and RW

In RW, the number of unknown words is 801; 41 words other appear no more than 100 times in Wikipedia. In WS353, in contrast, all words appear more than 100 times in Wikipedia. As some of the words are challenging even for native English speakers, the annotators were asked if they knew the first and second words. Word pairs unknown to most raters were discarded. We labeled the POS: 47% were nouns, 32% were verbs, 19% were adjectives, and 2% were adverbs.

To measure the semantic relatedness between a word pair $\left(w,w^{\prime}\right)$ in the datasets, we adopted the sense evaluation metrics AvgSim and MaxSim (Reisinger and Mooney Reference Reisinger and Mooney2010)

(15)
(16)

where $K_w$ and $K_w^{\prime}$ denote the number of senses of w and w , respectively. AvgSim can be seen as a soft metric as it takes the average of the similarity scores, whereas the MaxSim can be seen as a hard metric as it selects only those senses with the maximum similarity score. To measure the performance of the sense embeddings, we computed the Spearman correlation between the human-rated scores and the AvgSim/MaxSim scores. Table 3 shows a summary of the benchmark datasets and their relationships with the ontologies. Row 3 shows the number of words that were listed both in the datasets and the ontology. As some words in Roget were not retrofitted, rows 4 and 5 show the number and ratio of words that were affected by the retrofitting model. The word count for Roget was 63,942.

Table 3. Semantic relatedness benchmark datasets

5.3. Contextual word similarity

Although the semantic relatedness datasets have been used often, one disadvantage is that the words in these word pairs lack contexts. Therefore, we also conducted experiments with Stanford’s contextual word similarities (SCWS) dataset (Huang et al. Reference Huang, Socher, Manning and Ng2012), which consists of 2003 word pairs (1713 words in total, as some words are shown in multiple questions) together with human-rated scores. A higher score value indicates higher semantic relatedness. In contrast to the semantic relatedness datasets, SCWS words have contexts and POS tags, that is, the human subjects knew the usage of the word when they rated the similarity. For each word pair, we computed its AvgSimC/MaxSimC scores from the learned sense embedding (Reisinger and Mooney Reference Reisinger and Mooney2010)

(17)
(18)

where is the likelihood of context c belonging to cluster $\pi_k$ , and , the maximum likelihood cluster for w in context c. We used a window size of 5 for words in the word pairs (i.e., 5 words prior to $w/w^{\prime}$ and 5 words after $w/w^{\prime}$ ). Stop words were removed from the context. To measure the performance, we computed the Spearman correlation between the human-rated scores and the AvgSimC/MaxSimC scores.

5.4. Semantic difference

This task was to determine if a given word had a closer semantic feature to a concept than another word (Krebs and Paperno Reference Krebs and Paperno2016). In this dataset, there were 528 concepts, 24,963 word pairs, and 128,515 items. Each word pair came with a feature. For example, in the test $\left(airplane,helicopter\right):\,wings$ , the first word was to be chosen if and only if $\cos\left(airplane,wings\right)>\cos\left(helicopter,wings\right)$ ; otherwise, the second word was chosen. As this dataset did not provide context for disambiguation, we used strategies similar to the semantic relatedness task:

(19)
(20)

In AvgSimD, we chose the first word iff $AvgSimD\!\left(w_1,w^{\prime}\right)> AvgSimD\!\left(w_2,w^{\prime}\right)$ . In MaxSimD, we chose the first word iff $MaxSimD\!\left(w_1,w^{\prime}\right)> MaxSimD\!\left(w_2,w^{\prime}\right)$ . The performance was determined by computing the accuracy.

5.5. Synonym selection

Finally, we evaluated the proposed GenSense on three benchmark synonym selection datasets: ESL-50 (acronym for English as a Second Language) (Turney Reference Turney2001), RD-300 (acronym for Reader’s Digest Word Power Game) (Jarmasz and Szpakowicz Reference Jarmasz and Szpakowicz2004), and TOEFL-80 (acronym for Test of English as a Foreign Language) (Landauer and Dumais Reference Landauer and Dumais1997). The numbers in each task represent the numbers of questions in the dataset. In each question, there was a question word and a set of answer words. For each sense embedding, the task was to determine which word in the answer set was most similar to the question word. For example, with brass as the question word and metal, wood, stone, and plastic the answer words, the correct answer was metal.Footnote d As with the semantic relatedness task for the synonym selection task, we used AvgSim and MaxSim. We first used AvgSim/MaxSim to compute the scores for the question word and the words in the answer sets and then selected the answer with the maximum score. Performance was determined by computing the accuracy. Table 4 summarizes the synonym selection benchmark datasets.

Table 4. Synonym selection benchmark datasets

5.6. Training models

In the experiments, we use GloVe’s 50d version unless otherwise specified as the base model (Pennington et al. Reference Pennington, Socher and Manning2014). The pre-trained GloVe word embedding was trained on Wikipedia and Gigaword-5 (6B tokens, 400k vocab, uncased, 50d vectors). We also test GloVe’s 300d version and two well-known vector representation models from the literature: Word2Vec’s 300d version (trained on part of Google News dataset which contains 100 billion words) (Mikolov et al. Reference Mikolov, Chen, Corrado and Dean2013a) and FastText’s 300d version (2 million word vectors trained on Common Crawl which contains 600B tokens) (Bojanowski et al. Reference Bojanowski, Grave, Joulin and Mikolov2017). Since Word2Vec and FastText did not release a 50d version, we extract the first 50 dimensions from the 300d version to explore the impact of dimensionality. We also conduct experiments on four contextualized word embedding models BERT (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019), DistilBERT (Sanh et al. Reference Sanh, Debut, Chaumond and Wolf2019), RoBERTa (Liu et al. Reference Liu, Ott, Goyal, Du, Joshi, Chen, Levy, Lewis, Zettlemoyer and Stoyanov2019), and Transformer-XL (T-XL) (Dai et al. Reference Dai, Yang, Yang, Carbonell, Le and Salakhutdinov2019). To control the experiment, we extract the last four layers for all pre-trained transformer models to represent the word. We tried other layer settings and found that the concatenation of the last four layers can generate good results experimentally. We choose the base uncased version for BERT (3072d) and DistilBERT (3072d), the base version for RoBERTa (3072d), and the transfo-xl-wt103 version for T-XL (4096d).

We set the convergence criterion for the sense vectors to $\varepsilon=0.1$ and the number of iterations to 10. We used three types of generalization: GenSense-syn (only the synonyms and positive contextual neighbors were considered), GenSense-ant (only the antonyms and negative contextual neighbors were considered), and GenSense-all (everything was considered). Specifically, the objective function of the GenSense-syn was

(21) \begin{equation}\sum_{i=1}^{m}\left[\alpha_1\beta_{ii}\left\|s_i-\hat{q}_{t_i}\right\|^2+\alpha_2\sum_{\left(i,j\right)\in E_{r_1}}\beta_{ij}\left\|s_i-s_j\right\|^2+\alpha_3\sum_{\left(i,j\right)E_{r_2}}\beta_{ij}\left\|s_i-\hat{q}_j\right\|^2\right]\end{equation}

and the objective function of the GenSense-ant was

(22) \begin{equation}\sum_{i=1}^m\left[\alpha_1\beta_{ii}\left\|s_i-\hat{q}_{t_i}\right\|^2+\alpha_4\sum_{\left(i,j\right)\in E_{r_3}}\beta_{ij}\left\|s_i+s_j\right\|^2+\alpha_5\sum_{\left(i,j\right)\in E_{r_4}}\beta_{ij}\left\|s_i+\hat{q}_j\right\|^2\right].\end{equation}

6. Results and discussion

6.1. Semantic relatedness

Table 5 shows the Spearman correlation ( $\rho\times100$ ) of AvgSim and MaxSim between the human scores and the sense embedding scores on each benchmark dataset. For each version (except the transformer ones), the first row shows the performance of the vanilla word embedding. Note that the MaxSim and AvgSim scores are equal when there is only one sense for each word (word embedding). The second row shows the performance of the Retro model (Jauhar et al. Reference Jauhar, Dyer and Hovy2015). The third, fourth, and fifth rows show the GenSense performance for three versions: synonym, antonym, and all, respectively. The last row shows the Procrustes method using GenSense-all. The macro-averaged (average over the four benchmark datasets) and the micro-averaged (weighted average, considering the number of word pairs in every benchmark dataset) results are in the rightmost two columns.

Table 5. $\rho\times100$ of (MaxSim/AvgSim) on semantic relatedness benchmark datasets

From Table 5, we observe that the proposed model outperforms the Retro and GloVe in all datasets (macro and micro). The best overall model is Procrustes-all. All versions of the GenSense model outperform Retro in almost all the tasks. Retro performs poorly on the RW dataset. In RW, GenSense-syn’s MaxSim score exceeds Retro by $22.6$ (GloVe 300d), $29.3$ (FastText 300d), and $29.1$ (Word2Vec 300d). We also observe a significant growth in the Spearman correlation between GenSense-syn and GenSense-all. Surprisingly, the model with only synonyms and positive contextual information outperforms Retro and GloVe. After utilizing the antonym knowledge from Roget, its performance is further improved in all but the RW dataset. This suggests that the antonyms in Roget are quite informative and useful. Moreover, GenSense adapts information from synonyms and antonyms to boost its performance. Although the proposed model pulls sense vectors away from their reverse senses with the help of the antonyms and negative contextual information, this shift does not guarantee that the new sense vectors move to a better place every time with only negative relations. As a result, the GenSense-ant does not perform as well as GenSense-syn in general. Procrustes-all performs better than GenSense-all in most tasks, but the improvement is marginal. This is due to the high ratio of retrofitted words (see Table 3). In other words, the Procrustes method is applied only to a small portion of the sense vectors. In both of the additional evaluation metrics, the GenSense model outperforms Retro by a large margin. Procrustes-all is the best among the proposed models. These two metrics attest the robustness of our proposed model in comparison to the Retro model. For classic word embedding models, FastText outperforms Word2Vec, and Word2Vec outperforms GloVe, but not in all tasks. There is a clear gap between the 50d and 300d in all the models. Similar trend can be found in other nlp tasks (Yin and Shen Reference Yin and Shen2018).

The performance between transformer models and GenSense models may not be able to compare directly as their training corpus and dimensionality are very different. From the result, only DistilBERT 3072d outperforms the vanilla GloVe 50d model in RW and WS353 and has a performance gap when comparing to GloVe 300d, FastText 300d, and Word2Vec 300d in all the datasets. RoBERTa is the worst among the transformer models. The poor performance of the contextualized word representation model may be due to the fact that they cannot accurately capture the semantic equivalence of contexts (Shi et al. Reference Shi, Chen, Zhou and Chang2019). Another possible reason is the best configuration of the transformer models is not explored. Configurations like: How many layer(s) should we select?; How to combine the selected layers (concatenate, average, max pooling, or min pooling)? Although the performance of the transformers is poor here, we will show transformer models outperform GenSense using the same setting in the later experiment that involves context (Section 6.2).

We also conducted an experiment to evaluate the benefits yielded by the relation strength. We ran GenSense-syn over the Roget ontology with a grid of $\left(\alpha_1,\alpha_2,\alpha_3\right)$ parameters. Each parameter was tuned from $0.0$ to $1.0$ with a step size of $0.1$ . The default setting of GenSense was set to $\left(1.0,1.0,1.0\right)$ . Table 6 shows the MaxSim results and Table 7 shows the AvgSim results. Note that the $\alpha_1/\alpha_2/\alpha_3$ parameter combinations of the worst or the best case may be more than one. In that case, we only report one $\alpha_1/\alpha_2/\alpha_3$ setting in Tables 6 and 7. From Table 6, we observe that the default setting yields relatively good results in comparison to the best case. Another point worth mentioning is that the worst performance occurs under the $0.1/1/0.1$ setting, except for the WS353 dataset. Similar results can be found in Table 7’s AvgSim metric. Since $\alpha_1$ is to control the importance of the distance between the original vector and the retrofitted vector, small $\alpha_1$ leads to poor performance suggests that the original trained word vector should not deviate too far. When observing the best cases, we found $\alpha_1$ , $\alpha_2$ , and $\alpha_3$ are closed to each other in many tasks. For a deeper analysis on how the parameters affect the performance, Figure 3 shows the histogram of the performance when tuning $\alpha_1/\alpha_2/\alpha_3$ on all the dataset. From Figure 3, we find that in the tasks the distribution is left-skewed except the AvgSim of RW, suggesting the robustness of GenSense-syn.

Table 6. $\rho\times100$ of MaxSim on semantic relatedness benchmark datasets

Table 7. $\rho\times100$ of AvgSim on semantic relatedness benchmark datasets

Figure 3. Histogram of all the combinations and the corresponding performance (MaxSim and AvgSim).

We also ran GenSense-syn over the Roget ontology with another grid of parameters that contains a higher range. Specifically, the grid parameters were tuned from the parameter set $\left\{0.1,0.5,1.0,1.5,2.0,2.5,3.0,5.0,10.0\right\}$ . The results are shown in Tables 8 and 9. From the results, we find that the improvements for the best cases are almost the same as those in Tables 6 and 7 (MEN, MTurk, and WS353). Only RW’s performance increase is larger. In contrast, the worst case drops considerably for all datasets. For example, MEN’s MaxSim drops from $52.4$ to $37.0$ , a $15.4$ drop. This shows the importance of carefully selecting parameters in the learning model. Also worth mentioning is that these worst cases happen when $\alpha_2$ is large, showing the negative effect of too much weight for the synonym neighbors.

In addition to the parameters, it is also worth analyzing the impact of dimensionality. Figure 4 shows the $\rho\times100$ of MaxSim on the semantic relatedness benchmark datasets as a function of the vector dimension. All GloVe pre-trained models were trained on the 6-billion-token corpus of 50d, 100d, 200d, and 300d. We used the GenSense-all model on the GloVe pre-trained models. In Figure 4, the proposed GenSense-all outperforms GloVe in all the datasets for all the tested dimensions. In GloVe’s original paper, they showed GloVe’s performance (in terms of accuracy) to be proportional to the dimension between 50d and 300d. In this experiment, we show that both GloVe and GenSense-all’s performance is proportional to dimension between 50d and 300d in terms of $\rho\times100$ of MaxSim. Similar results are found for the AvgSim metric.

Table 8. $\rho\times100$ of MaxSim on semantic relatedness benchmark datasets

Table 9. $\rho\times100$ of AvgSim on semantic relatedness benchmark datasets

Figure 4. $\rho\times100$ of MaxSim on semantic relatedness benchmark datasets as a function of vector dimension. Compared with the GloVe model.

Table 10 shows the selected MEN’s word pairs and their corresponding GenSense-all, GloVe, and Retro scores for case study. For GenSense-all, GloVe, and Retro, we sorted the MaxSim scores and re-scaled them to MEN’s score distribution. From Table 10, we find that Gensense-all improves the pre-trained word-embedding model (in terms of closeness to MEN’s score; smaller is better) in the following situations: (1) both words have few senses $\left(lizard, reptiles\right)$ , (2) both words have many senses $\left(stripes, train\right)$ , and (3) one word has many senses and one word has few senses $\left(rail, railway\right)$ . In other words, GenSense-all handles all possible situations well. In some cases, the Retro model increases the closeness to MEN’s score.

Table 10. Selected MEN’s word pairs and their score differences from GenSense-all, GloVe, and Retro models (smaller is better)

6.1.1. Standardization on dimensions

Table 11 shows results of the vanilla word embedding (GloVe), the standardized vanilla word embedding (GloVe-Z), GenSense-all, and the four standardized GenSense methods (rows 4 to 7). The results show that standardization on the vanilla word embedding (GloVe-Z) improves some datasets but not all. In contrast, standardization on GenSense outperforms both GenSense-all and the GloVe-Z models. This suggests that the vanilla word embedding may not be optimized well. Although there is no model that consistently performs the best of all the standardization models, overall Z-GenSense performs the best in terms of Macro and Micro metrics.

Table 11. $\rho\times100$ of (MaxSim/AvgSim) of standardization on dimensions on semantic relatedness benchmark datasets

6.1.2. Neighbor expansion from nearest neighbors

Table 12 shows the vanilla word embedding (GloVe), the Retro model, and the neighbor expansion model. The results verify our assumption that nearest neighbors play an important role in the GenSense model. From Table 12, GenSense-NN outperforms GloVe and Retro on all the benchmark datasets.

Table 12. $\rho\times100$ of (MaxSim/AvgSim) of neighbor expansion from nearest neighbors on semantic relatedness benchmark datasets

6.1.3. Combination of standardization and neighbor expansion

Table 13 shows the experimental results of the vanilla word embedding (GloVe), the Retro model, and four combination models (GenSense-NN-Z, GenSense-NN-Z-last, GenSense-Z-NN, and GenSense-Z-NN-first). The overall best model is GenSense-NN-Z-last (in terms of Macro and Micro). GenSense-Z-NN also performs the best in Macro’s AvgSim. Again, there is no dominant model (that outperforms all other models) among the combination models, but almost all models outperform the baseline models GloVe and Retro. Tables 12 and 13 suggest that all the benchmark datasets can be further improved.

Table 13. $\rho\times100$ of (MaxSim/AvgSim) of combination of standardization and neighbor expansion on semantic relatedness benchmark datasets

6.2. Contextual word similarity

Table 14 shows the Spearman correlation $\rho\times100$ of SCWS dataset. Unlike other models, in the DistilBERT we firstly embed the entire sentence and then extract the embedding of the word. After extracting the embedding of the pair of the words, we compute their cosine similarity. With the sense-level information, both GenSense and Retro outperform the word embedding model GloVe. The GenSense model slightly outperforms Retro. The results suggest that the negative relation information in GenSense-ant may not be helpful. We suspect that the quality of SCWS may not be controlled well. As there are 10 subjects for each question in SCWS, we further analyzed the distribution of the ranges in SCWS. We found that there are many questions with a large range, reflecting the vagueness of the questions. Overall, $35.5$ % of the questions had a range of 10 (i.e., some subjects assigned the highest score and some assigned the lowest score), and $50.0$ % had a range of 9 or 10. Unlike the semantic related experiment’s result, all the transformer models outperform the GenSense models. The result is not surprising as the contextualized word embedding models are pre-trained to better represent the target word given its context through masked language modeling and next sentence prediction tasks.

Table 14. $\rho\times100$ of (MaxSimC/AvgSimC) on SCWS dataset

6.3. Semantic difference

Table 15 shows the results of the semantic difference experiment. We observe that GenSense outperforms Retro and GloVe with small margin, and the accuracy of Retro decreases in this experiment. As this task focuses on concepts, we find that synonym and antonym information is not very useful when comparing the results with GloVe. This experiment also suggests that further information about concepts is required to improve performance. Surprisingly, the antonym relation plays an important role when computing the semantic difference, especially in the AvgSimD metric.

Table 15. $\left(Accuracy,Precision,Recall\right)\times100$ of (MaxSimD/AvgSimD) on the semantic difference dataset

6.4. Synonym selection

Table 16 shows the results of the synonym selection experiment: GenSense-all outperforms the baseline models on the RD-300 and TOEFL-80 datasets. In ESL-50, the best model is Retro, showing that improvements are still possible for the default GenSense model. Nevertheless, in ESL-50 GenSense-syn and GenSense-all outperform the vanilla GloVe embedding by a large margin. We also note the relatively poor performance of GenSense-ant in comparison to GenSense-syn and GenSense-all; this shows that the antonym information is relatively unimportant in the synonym selection task.

Table 16. $Accuracy\times100$ of (MaxSim/AvgSim) on the synonym selection datasets ESL-50, RD-300, and TOEFL-80

7. Limitations and future directions

This research focuses on generalizing sense retrofitting models and evaluates the models on semantic relatedness, contextual word similarity, semantic difference, and synonym selection datasets. In the semantic relatedness experiment, we compare GenSense and BERT family models and show that GenSense can outperform BERT models. However, the experiment may be unfair to the BERT family models as they are context sensitive language models, while each phrase pair in the dataset is context-free. The slight difference of the dataset and models nature makes comparisons are not as easy to interpret. A possible way to address the issue is to apply the generalized sense representation learnt by the proposed method in downstream natural language processing applications to conduct extrinsic evaluations. If the downstream tasks involve context-sensitive features, the tasks themselves will give advantage to the context sensitive models. Nevertheless, it would be interesting to evaluate the embeddings extrinsically via training neural network models of the same architecture for downstream tasks (e.g., named entity recognition (Santos, Consoli, and Vieira Reference Santos, Consoli and Vieira2020) and comparing different language models. Another research direction that relates to this research is fair NLP. Since word embeddings are largely affected by corpus statistics, many works focus on debiasing word embeddings, especially in gender, human race, and society (Bolukbasi et al. Reference Bolukbasi, Chang, Zou, Saligrama and Kalai2016; Caliskan, Bryson, and Narayanan Reference Caliskan, Bryson and Narayanan2017; Brunet et al. Reference Brunet, Alkalay-Houlihan, Anderson and Zemel2019). How to incorporate the techniques in debiasing word embeddings to GenSense model is worth exploring. Finally, social sciences, psychology, and history research fields largely depend on high quality word or sense embedding models (Hamilton, Leskovec, and Jurafsky Reference Hamilton, Leskovec and Jurafsky2016; Caliskan et al. Reference Caliskan, Bryson and Narayanan2017; Jonauskaite et al. Reference Jonauskaite, Sutton, Cristianini and Mohr2021). Our proposed GenSense embedding model can bring great value to these research fields.

8. Conclusions

In this paper, we present GenSense, a generalized framework for learning sense embeddings. As GenSense belongs to post-processing retrofitting model family, it enjoys all the benefits of retrofitting, such as shorter training times and lower memory requirements. In the generalization, (1) we extend the synonym relation to the positive contextual neighbor relation, the antonym relation, and the negative contextual neighbor relation; (2) we consider the semantic strength of each relation; and (3) we use the relation strength between relations to balance different components. We conduct experiments on four types of tasks: semantic relatedness, contextual word similarity, semantic difference, and synonym selection. Experimental results show that GenSense outperforms previous approaches. In the experiment with grid search to evaluate the benefits yielded by the relation strength, we find a $87.7$ % performance difference between the worst and the best cases in WS353 dataset. Based on the proposed GenSense, we also propose a standardization process on the dimensions with four settings, a neighbor expansion process from the nearest neighbors, and four different combinations of the two approaches. Finally, we propose a Procrustes analysis approach that inspired from bilingual mapping models for learning representations that outside of the ontology. The experimental results show the advantages of these modifications on the semantic relatedness task. Finally, we have released the source code and the pre-trained model as a resource for the research community.Footnote e,Footnote f Other versions of the sense-retrofitted embeddings can be found on the website.

Acknowledgements

This research was partially supported by Ministry of Science and Technology, Taiwan, under grants MOST-107-2634-F-002-019, MOST-107-2634-F-002-011, and MOST-106-2923-E-002-012-MY3.

Footnotes

These authors contributed equally to this work.

a Besides Word2Vec and GloVe, embeddings trained by others on different corpora and different parameters may also be used.

b Note that $\hat{q}_{t_j}$ and $\hat{q}_{t_k}$ may be mapped to the same vector representation even if $j\neq k$ . For example, let $t_j$ be gay.n.01 and $t_k$ be gay.a.01. Then both $\hat{q}_{t_j}$ and $\hat{q}_{t_k}$ are mapped to the word form vector of gay.

c Theoretically, the dimensionality of $x_i$ and $y_i$ can be different. However, in this study we only consider the special case the dimensionality of $x_i$ and $y_i$ be the same.

d Note that there are few cases when answer set contains phrase (e.g., decoration style). In such situation, we remove those questions. Another few cases are that the question or answer words are not in the word embedding’s vocabulary, in that case, we also remove those questions from the dataset.

References

Artetxe, M., Labaka, G. and Agirre, E. (2016). Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), Austin, TX, USA. Association for Computational Linguistics, pp. 2289–2294.CrossRefGoogle Scholar
Azzini, A., da Costa Pereira, C., Dragoni, M. and Tettamanzi, A.G. (2011). A neuro-evolutionary corpus-based method for word sense disambiguation. IEEE Intelligent Systems 27(6), 2635.CrossRefGoogle Scholar
Bengio, Y., Delalleau, O. and Le Roux, N. (2006). Label propagation and quadratic criterion. In Semi-Supervised Learning.CrossRefGoogle Scholar
Bian, J., Gao, B. and Liu, T.-Y. (2014). Knowledge-powered deep learning for word embedding. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD), Nancy, France. Springer, pp. 132148.CrossRefGoogle Scholar
Bojanowski, P., Grave, E., Joulin, A. and Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5, 135146.CrossRefGoogle Scholar
Bollegala, D., Alsuhaibani, M., Maehara, T. and Kawarabayashi, K.-i. (2016). Joint word representation learning using a corpus and a semantic lexicon. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI), Phoenix, AZ, USA. AAAI Press, pp. 2690–2696.CrossRefGoogle Scholar
Bolukbasi, T., Chang, K.-W., Zou, J.Y., Saligrama, V. and Kalai, A.T. (2016). Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS), vol. 29, Barcelona, Spain. Curran Associates, Inc.Google Scholar
Brunet, M.-E., Alkalay-Houlihan, C., Anderson, A. and Zemel, R. (2019). Understanding the origins of bias in word embeddings. In Proceedings of the 36th International Conference on Machine Learning (ICML), Long Beach, CA, USA. PMLR, pp. 803–811.Google Scholar
Bruni, E., Tran, N.-K. and Baroni, M. (2014). Multimodal distributional semantics. Journal of Artificial Intelligence Research 49, 147.CrossRefGoogle Scholar
Bullinaria, J.A. and Levy, J.P. (2007). Extracting semantic representations from word co-occurrence statistics: A computational study. Behavior Research Methods 39(3), 510526.CrossRefGoogle ScholarPubMed
Caliskan, A., Bryson, J.J. and Narayanan, A. (2017). Semantics derived automatically from language corpora contain human-like biases. Science 356(6334), 183186.CrossRefGoogle ScholarPubMed
Camacho-Collados, J. and Pilehvar, M.T. (2018). From word to sense embeddings: A survey on vector representations of meaning. Journal of Artificial Intelligence Research 63, 743788.CrossRefGoogle Scholar
Camacho-Collados, J., Pilehvar, M.T. and Navigli, R. (2015). Nasari: A novel approach to a semantically-aware representation of items. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Denver, CO, USA. Association for Computational Linguistics, pp. 567–577.CrossRefGoogle Scholar
Dai, Z., Yang, Z., Yang, Y., Carbonell, J.G., Le, Q. and Salakhutdinov, R. (2019). Transformer-xl: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), Florence, Italy. Association for Computational Linguistics, pp. 29782988.CrossRefGoogle Scholar
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K. and Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science 41(6), 391407.3.0.CO;2-9>CrossRefGoogle Scholar
Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (NAACL-HLT), Minneapolis, MN, USA. Association for Computational Linguistics, pp. 41714186.Google Scholar
Dolan, B., Quirk, C. and Brockett, C. (2004). Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In Proceedings of the 20th International Conference on Computational Linguistics (COLING), Geneva, Switzerland. COLING, pp. 350–356.CrossRefGoogle Scholar
Dragoni, M. and Petrucci, G. (2017). A neural word embeddings approach for multi-domain sentiment analysis. IEEE Transactions on Affective Computing 8(4), 457470.CrossRefGoogle Scholar
Ethayarajh, K. (2019). How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China. Association for Computational Linguistics, pp. 55–65.CrossRefGoogle Scholar
Ettinger, A., Resnik, P. and Carpuat, M. (2016). Retrofitting sense-specific word vectors using parallel text. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), San Diego, CA, USA. Association for Computational Linguistics, pp. 13781383.CrossRefGoogle Scholar
Faruqui, M., Dodge, J., Jauhar, S.K., Dyer, C., Hovy, E. and Smith, N.A. (2015). Retrofitting word vectors to semantic lexicons. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Denver, CO. Association for Computational Linguistics, pp. 16061615.CrossRefGoogle Scholar
Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G. and Ruppin, E. (2002). Placing search in context: The concept revisited. ACM Transactions on Information Systems 20(1), 116131.Google Scholar
Ganitkevitch, J., Van Durme, B. and Callison-Burch, C. (2013). PPDB: The paraphrase database. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Atlanta, GA, USA. Association for Computational Linguistics, pp. 758764.Google Scholar
Glavaš, G. and Vulić, I. (2018). Explicit retrofitting of distributional word vectors. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) ACL, Melbourne, Australia. Association for Computational Linguistics, pp. 3445.CrossRefGoogle Scholar
Hamilton, W.L., Leskovec, J. and Jurafsky, D. (2016). Diachronic word embeddings reveal statistical laws of semantic change. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (ACL), Berlin, Germany. Association for Computational Linguistics, pp. 1489–1501.CrossRefGoogle Scholar
Huang, E.H., Socher, R., Manning, C.D. and Ng, A.Y. (2012). Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (ACL), Jeju Island, Korea. Association for Computational Linguistics, pp. 873–882.Google Scholar
Iacobacci, I., Pilehvar, M.T. and Navigli, R. (2015). Sensembed: Learning sense embeddings for word and relational similarity. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (ACL-IJCNLP), Beijing, China. Association for Computational Linguistics, pp. 95105.CrossRefGoogle Scholar
Jarmasz, M. and Szpakowicz, S. (2004). Roget’s thesaurus and semantic similarity. In Recent Advances in Natural Language Processing III: Selected Papers from RANLP, 2003, 111.Google Scholar
Jauhar, S.K., Dyer, C. and Hovy, E. (2015). Ontologically grounded multi-sense representation learning for semantic vector space models. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Denver, CO, USA. Association for Computational Linguistics, pp. 683–693.CrossRefGoogle Scholar
Jonauskaite, D., Sutton, A., Cristianini, N. and Mohr, C. (2021). English colour terms carry gender and valence biases: A corpus study using word embeddings. PLoS ONE 16(6), e0251559.CrossRefGoogle ScholarPubMed
Joulin, A., Bojanowski, P., Mikolov, T., Jégou, H. and Grave, E. (2018). Loss in translation: Learning bilingual word mapping with a retrieval criterion. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), Brussels, Belgium. Association for Computational Linguistics, pp. 2979–2984.CrossRefGoogle Scholar
Kipfer, B.A. (1993). Roget’s 21st Century Thesaurus in Dictionary Form: The Essential Reference for Home, School, or Office. Laurel.Google Scholar
Krebs, A. and Paperno, D. (2016). Capturing discriminative attributes in a distributional space: Task proposal. In Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, Berlin, Germany. Association for Computational Linguistics, pp. 51–54.CrossRefGoogle Scholar
Landauer, T.K. and Dumais, S.T. (1997). A solution to plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review 104(2), 211240.CrossRefGoogle Scholar
Leacock, C. and Chodorow, M. (1998). Combining local context and WordNet similarity for word sense identification. WordNet: An Electronic Lexical Database 49(2), 265283.Google Scholar
Lee, Y.-Y., Ke, H., Huang, H.-H. and Chen, H.-H. (2016). Combining word embedding and lexical database for semantic relatedness measurement. In Proceedings of the 25th International Conference Companion on World Wide Web (WWW), Montréal, Québec, Canada. International World Wide Web Conferences Steering Committee, pp. 73–74.CrossRefGoogle Scholar
Lee, Y.-Y., Yen, T.-Y., Huang, H.-H. and Chen, H.-H. (2017). Structural-fitting word vectors to linguistic ontology for semantic relatedness measurement. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (CIKM), Singapore, Singapore. Association for Computing Machinery, pp. 2151–2154.CrossRefGoogle Scholar
Lee, Y.-Y., Yen, T.-Y., Huang, H.-H., Shiue, Y.-T. and Chen, H.-H. (2018). Gensense: A generalized sense retrofitting model. In Proceedings of the 27th International Conference on Computational Linguistics (COLING), Santa Fe, NM, USA. Association for Computational Linguistics, pp. 16621671.Google Scholar
Lengerich, B.J., Maas, A.L. and Potts, C. (2017). Retrofitting distributional embeddings to knowledge graphs with functional relations. In Proceedings of the 27th International Conference on Computational Linguistics (COLING), Santa Fe, NM, USA. Association for Computational Linguistics.Google Scholar
Li, J. and Jurafsky, D. (2015). Do multi-sense embeddings improve natural language understanding? In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), Lisbon, Portugal. Association for Computational Linguistics, pp. 1722–1732.Google Scholar
Lin, D. (1998). An information-theoretic definition of similarity. In Proceedings of the Fifteenth International Conference on Machine Learning (ICML), vol. 98, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc., pp. 296304.Google Scholar
Lin, D. and Pantel, P. (2001). Dirt – discovery of inference rules from text. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), San Francisco, CA, USA. Association for Computing Machinery, pp. 323–328.CrossRefGoogle Scholar
Liu, X., Nie, J.-Y. and Sordoni, A. (2016). Constraining word embeddings by prior knowledge–application to medical information retrieval. In Information Retrieval Technology. Beijing, China: Springer International Publishing, pp. 155167.CrossRefGoogle Scholar
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L. and Stoyanov, V. (2019). Roberta: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692.Google Scholar
Loureiro, D. and Jorge, A. (2019). Language modelling makes sense: Propagating representations through wordnet for full-coverage word sense disambiguation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), Florence, Italy. Association for Computational Linguistics, pp. 5682–5691.CrossRefGoogle Scholar
Luong, T., Socher, R. and Manning, C. (2013). Better word representations with recursive neural networks for morphology. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning (CoNLL), Sofia, Bulgaria. Association for Computational Linguistics, pp. 104–113.Google Scholar
Mancini, M., Camacho-Collados, J., Iacobacci, I. and Navigli, R. (2017). Embedding words and senses together via joint knowledge-enhanced training. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL), Vancouver, Canada. Association for Computational Linguistics, pp. 100–111.CrossRefGoogle Scholar
Maneewongvatana, S. and Mount, D.M. (1999). It’s okay to be skinny, if your friends are fat. In Center for Geometric Computing 4th Annual Workshop on Computational Geometry, vol. 2, pp. 1–8.Google Scholar
Mikolov, T., Chen, K., Corrado, G. and Dean, J. (2013a). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.Google Scholar
Mikolov, T., Le, Q.V. and Sutskever, I. (2013b). Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168.Google Scholar
Miller, G.A. (1998). WordNet: An Electronic Lexical DatabaseCambridge, MA: MIT Press.Google Scholar
Mrkšic, N., OSéaghdha, D., Thomson, B., Gašic, M., Rojas-Barahona, L., Su, P.-H., Vandyke, D., Wen, T.-H. and Young, S. (2016). Counter-fitting word vectors to linguistic constraints. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), San Diego, California. Association for Computational Linguistics, pp. 142–148.CrossRefGoogle Scholar
Pavlick, E., Rastogi, P., Ganitkevitch, J., Van Durme, B. and Callison-Burch, C. (2015). PPDB 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classification. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers) (ACL-IJCNLP), Beijing, China. Association for Computational Linguistics, pp. 425–430.Google Scholar
Pennington, J., Socher, R. and Manning, C. (2014). GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar. Association for Computational Linguistics, pp. 15321543.CrossRefGoogle Scholar
Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K. and Zettlemoyer, L. (2018). Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) (NAACL), New Orleans, Louisiana. Association for Computational Linguistics.Google Scholar
Qiu, L., Tu, K. and Yu, Y. (2016). Context-dependent sense embedding. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), Austin, Texas. Association for Computational Linguistics, pp. 183191.CrossRefGoogle Scholar
Quirk, C., Brockett, C. and Dolan, W.B. (2004). Monolingual machine translation for paraphrase generation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP), Barcelona, Spain. Association for Computational Linguistics, pp. 142149.Google Scholar
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D. and Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9.Google Scholar
Radinsky, K., Agichtein, E., Gabrilovich, E. and Markovitch, S. (2011). A word at a time: Computing word relatedness using temporal semantic analysis. In Proceedings of the 20th International Conference on World Wide Web (WWW). New York, NY, USA: Association for Computing Machinery, pp. 337–346.CrossRefGoogle Scholar
Reisinger, J. and Mooney, R.J. (2010). Multi-prototype vector-space models of word meaning. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT), Los Angeles, CA, USA. Association for Computational Linguistics, pp. 109–117.Google Scholar
Remus, S. and Biemann, C. (2018). Retrofitting word representations for unsupervised sense aware word similarities. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC), Miyazaki, Japan. European Language Resources Association.Google Scholar
Sanh, V., Debut, L., Chaumond, J. and Wolf, T. (2019). Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.Google Scholar
Santos, J., Consoli, B. and Vieira, R. (2020). Word embedding evaluation in downstream tasks and semantic analogies. In Proceedings of the Twelfth Language Resources and Evaluation Conference (LREC), Marseille, France. European Language Resources Association, pp. 4828–4834.Google Scholar
Scarlini, B., Pasini, T. and Navigli, R. (2020). Sensembert: Context-enhanced sense embeddings for multilingual word sense disambiguation. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) 34(05), 87588765.CrossRefGoogle Scholar
Shi, W., Chen, M., Zhou, P. and Chang, K.-W. (2019). Retrofitting contextualized word embeddings with paraphrases. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China. Association for Computational Linguistics.CrossRefGoogle Scholar
Smith, S.L., Turban, D.H., Hamblin, S. and Hammerla, N.Y. (2017). Offline bilingual word vectors, orthogonal transformations and the inverted softmax. In 5th International Conference on Learning Representations (ICLR), Toulon, France. OpenReview.net.Google Scholar
Sun, F., Guo, J., Lan, Y., Xu, J. and Cheng, X. (2016). Inside out: Two jointly predictive models for word representations and phrase representations. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 30, Phoenix, AZ, USA. AAAI Press.CrossRefGoogle Scholar
Turney, P.D. (2001). Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In Proceedings of the 12th European Conference on Machine Learning (ECML). Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 491–502.CrossRefGoogle Scholar
Wiedemann, G., Remus, S., Chawla, A. and Biemann, C. (2019). Does BERT make any sense? Interpretable word sense disambiguation with contextualized embeddings. In Proceedings of the 15th Conference on Natural Language Processing (KONVENS), Erlangen, Germany. German Society for Computational Linguistics & Language Technology.Google Scholar
Wu, Z. and Palmer, M. (1994). Verb semantics and lexical selection. In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics (ACL), Las Cruces, NM, USA. Association for Computational Linguistics, pp. 133138.CrossRefGoogle Scholar
Xing, C., Wang, D., Liu, C. and Lin, Y. (2015). Normalized word embedding and orthogonal transform for bilingual word translation. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Denver, CO, USA. Association for Computational Linguistics, pp. 1006–1011.CrossRefGoogle Scholar
Yen, T.-Y., Lee, Y.-Y., Huang, H.-H. and Chen, H.-H. (2018). That makes sense: Joint sense retrofitting from contextual and ontological information. In Companion Proceedings of the Web Conference 2018 (WWW), Lyon, France. International World Wide Web Conferences Steering Committee, pp. 15–16.CrossRefGoogle Scholar
Yin, Z. and Shen, Y. (2018). On the dimensionality of word embedding. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NIPS), vol. 31, Montréal, Canada. Curran Associates, Inc., pp. 895–906.Google Scholar
Yu, M. and Dredze, M. (2014). Improving lexical embeddings with semantic knowledge. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (ACL), Baltimore, MD, USA. Association for Computational Linguistics, pp. 545550.CrossRefGoogle Scholar
Figure 0

Table 1. Summary of the semantic relations. $sf(t_i)$ is the word surface form of sense $t_i$

Figure 1

Figure 1. An illustration of the relation network. Different node textures represent different roles (e.g., synonym and antonym) in the GenSense model.

Figure 2

Algorithm 1. GenSense

Figure 3

Algorithm 2. Standardization Process

Figure 4

Algorithm 3. GenSense with Standardization Process

Figure 5

Figure 2. Nearest neighbors of cheap (a) and love (b). In (a), cheaps neighbors are uniformly distributed and thus not helpful. In (b), loves neighbors are gathered in the quadrant I, and thus can be attracted toward them.

Figure 6

Algorithm 4. GenSense-NN

Figure 7

Algorithm 5. Construction of ontology given Roget’s senses

Figure 8

Table 2. Word frequency distributions for WS353 and RW

Figure 9

Table 3. Semantic relatedness benchmark datasets

Figure 10

Table 4. Synonym selection benchmark datasets

Figure 11

Table 5. $\rho\times100$ of (MaxSim/AvgSim) on semantic relatedness benchmark datasets

Figure 12

Table 6. $\rho\times100$ of MaxSim on semantic relatedness benchmark datasets

Figure 13

Table 7. $\rho\times100$ of AvgSim on semantic relatedness benchmark datasets

Figure 14

Figure 3. Histogram of all the combinations and the corresponding performance (MaxSim and AvgSim).

Figure 15

Table 8. $\rho\times100$ of MaxSim on semantic relatedness benchmark datasets

Figure 16

Table 9. $\rho\times100$ of AvgSim on semantic relatedness benchmark datasets

Figure 17

Figure 4. $\rho\times100$ of MaxSim on semantic relatedness benchmark datasets as a function of vector dimension. Compared with the GloVe model.

Figure 18

Table 10. Selected MEN’s word pairs and their score differences from GenSense-all, GloVe, and Retro models (smaller is better)

Figure 19

Table 11. $\rho\times100$ of (MaxSim/AvgSim) of standardization on dimensions on semantic relatedness benchmark datasets

Figure 20

Table 12. $\rho\times100$ of (MaxSim/AvgSim) of neighbor expansion from nearest neighbors on semantic relatedness benchmark datasets

Figure 21

Table 13. $\rho\times100$ of (MaxSim/AvgSim) of combination of standardization and neighbor expansion on semantic relatedness benchmark datasets

Figure 22

Table 14. $\rho\times100$ of (MaxSimC/AvgSimC) on SCWS dataset

Figure 23

Table 15. $\left(Accuracy,Precision,Recall\right)\times100$ of (MaxSimD/AvgSimD) on the semantic difference dataset

Figure 24

Table 16. $Accuracy\times100$ of (MaxSim/AvgSim) on the synonym selection datasets ESL-50, RD-300, and TOEFL-80