1. Introduction
Undirected binary network data measure the presence or absence of a relationship between pairs of actors and have recently become extremely common in the social and biological sciences. Some examples of data that are naturally represented as undirected binary networks are international relations among countries (Fagiolo et al., Reference Fagiolo, Reyes and Schiavo2008), gene co-expression (Zhang & Horvath, Reference Zhang and Horvath2005), and interactions among students (Han et al., Reference Han, McCubbins and Paulsen2016). We focus on an example of politically aligned books, where a relation exists between two books if they were frequently purchased by the same person on Amazon.com. Our motivations are estimation of the effects of exogenous covariates, such as the effect of alignment of political ideologies of pairs of books on the propensity for books to be purchased by the same consumer, and the related problem of predicting unobserved relations using book ideological information. For example, predictions of relations between new books and old books could be used to recommend new books to potential purchasers.
A binary, undirected network $\left \{ y_{ij} \in \{0,1 \}: i,j\in \{1,\ldots,n\}, i \lt j \right \}$ , which we abbreviate $\{y_{ij} \}_{ij}$ , may be represented as an $n \times n$ symmetric adjacency matrix which describes the presence or absence of relationships between unordered pairs of $n$ actors. The diagonal elements of the matrix $\{y_{ii}\;:\; i \in \{1,\ldots,n\}\}$ are assumed to be undefined, as we do not consider actor relations with him/herself. We use $\textbf{y}$ to refer to the $\binom{n}{2}$ vector of network relations formed by a columnwise vectorization of the upper triangle of the matrix corresponding to $\{y_{ij} \}_{ij}$ .
A regression model for the probability of observing a binary outcome is the probit model, which can be expressed
where $\epsilon _{ij}$ is a mean-zero normal random error, ${\textbf{x}}_{ij}$ is a fixed vector of covariates corresponding to relation $ij$ , and $\boldsymbol{\beta }$ is a vector of coefficients to be estimated. When each entry in the error network $\{ \epsilon _{ij} \}_{ij}$ is independent of the others, estimation of the probit regression model in (1) is straightforward and proceeds via standard gradient methods for maximum likelihood estimation of generalized linear models (Greene, Reference Greene2003). The assumption of independence of $\{ \epsilon _{ij} \}_{ij}$ may be appropriate when the mean $\big\{{\textbf{x}}_{ij}^T{\boldsymbol{\beta }} \big\}_{ij}$ represents nearly all of the dependence in the network $\{ y_{ij}\}_{ij}$ . However, network data naturally contain excess dependence beyond the mean: the errors $\epsilon _{ij}$ and $\epsilon _{ik}$ both concern actor $i$ (see Faust & Wasserman, Reference Faust and Wasserman1994, e.g., for further discussion of dependencies in network data). In the context of the political books data set, the propensity of “Who’s Looking Out For You?” by Bill O’Reilly to be purchased by the same reader as “Deliver Us from Evil” by Sean Hannity may be similar to the propensity of “Who’s Looking Out For You?” and “My Life” by Bill Clinton to be co-purchased simply because “Who’s Looking Out For You?” is a popular book. Or, in a student friendship network, the friendship that Julie makes with Steven may be related to the friendship that Julie makes with Asa due to Julie’s gregariousness. Unlike the case of typical linear regression, the estimator that maximizes the likelihood of the generalized linear regression model in (1), when assuming independence of each entry in the error network $\{ \epsilon _{ij} \}_{ij}$ , is not unbiased for $\boldsymbol{\beta }$ . Ignoring the excess dependence in $\{\epsilon _{ij} \}_{ij}$ can thus be expected to result in poor estimation of $\boldsymbol{\beta }$ and poor out-of-sample predictive performance. We observe this phenomenon in the simulation studies and analysis of the political books network (see Sections 7 and 8, respectively). Thus, estimators of $\boldsymbol{\beta }$ and ${\mathbb{P}}(y_{ij} = 1)$ in (1) for the network $\{ y_{ij} \}_{ij}$ should ideally account for the excess dependence of network data. A host of regression models exist in the literature that do just this; we briefly review these here.
A method used to account for excess dependence in regression of binary network data is the estimation of generalized linear mixed models, which were first introduced for repeated measures studies (Stiratelli et al., Reference Stiratelli, Laird and Ware1984; Breslow & Clayton, Reference Breslow and Clayton1993). In these models, a random effect, that is, latent variable, is estimated for each individual in the study, to account for possible individual variation. Warner et al. (Reference Warner, Kenny and Stoto1979) used latent variables to account for excess network dependence when analyzing data with continuous measurements of relationships between actors, and Holland & Leinhardt (Reference Holland and Leinhardt1981) extended their approach to networks consisting of binary observations. Hoff et al. (Reference Hoff, Raftery and Handcock2002) further extended this approach to include nonlinear functions of latent variables, and since then, many variations have been proposed (Handcock et al., Reference Handcock, Raftery and Tantrum2007; Hoff, Reference Hoff2008; Sewell & Chen, Reference Sewell and Chen2015). We refer to parametric network models wherein the observations are independent conditional on random latent variables as “latent variable network models,” which we discuss in detail in Section 2. Separate latent variable approaches may lead to vastly different estimates of $\boldsymbol{\beta }$ , and it may not be clear which model’s estimate of $\boldsymbol{\beta }$ , or prediction, to choose. Goodness-of-fit checks are the primary method of assessing latent variable network model fit (Hunter et al., Reference Hunter, Goodreau and Handcock2008b); however, selecting informative statistics is a well-known challenge. Finally, latent variable network models are typically computationally burdensome to estimate, often relying on Markov chain Monte Carlo methods.
Another approach to estimating covariate effects on network outcomes is the estimation of exponential random graph models, known as ERGMs. ERGMs represent the probability of relation formation using a generalized exponential family distribution, ${\mathbb{P}}(y_{ij} = 1) \propto \textrm{exp}(\textbf{t}(\textbf{y}_{ij},{\textbf{x}}_{ij})^T \theta )$ , where $\theta$ is a vector of parameters to be estimated. In this flexible formulation, the effects of the exogenous covariates are included in the network statistics $\textbf{t}(\textbf{y}_{ij},{\textbf{x}}_{ij})$ . ERGMs also account for excess network dependence using the network statistics $\textbf{t}(\textbf{y}_{ij},{\textbf{x}}_{ij})$ , such as counts of the number of observed triangles or the number of “2-stars”—pairs of indicated relations that share an actor. ERGMs were developed by Frank & Strauss (Reference Frank and Strauss1986) and Snijders et al. (Reference Snijders, Pattison, Robins and Handcock2006) and are typically estimated using Markov chain Monte Carlo (MCMC) approximations to posterior distributions (Snijders, Reference Snijders2002; Handcock et al., Reference Handcock, Hunter, Butts, Goodreau, Krivitsky and Morris2019; Hunter et al., Reference Hunter, Handcock, Butts, Goodreau and Morris2008a). ERGMs have been shown to be prone to place unrealistic quantities of probability mass on networks consisting of all “1”s or all “0”s (Handcock et al., Reference Handcock, Robins, Snijders, Moody and Besag2003; Schweinberger, Reference Schweinberger2011), and the estimation procedures may be slow to complete (Caimo & Friel, Reference Caimo and Friel2011). Further, parameter estimates typically cannot be generalized to populations outside the observed network (Shalizi & Rinaldo, Reference Shalizi and Rinaldo2013).
A final approach to account for excess network dependence is to explicitly model the correlation among network observations. This is the approach we take in this paper. In this approach, an unobserved normal random variable, $z_{ij}$ , is proposed to underlie each data point, such that $y_{ij} = \textbf{1}[z_{ij} \gt 0]$ for $\textbf{z} \sim \textrm{N}(\textbf{X}{\boldsymbol{\beta }}, \boldsymbol{\Omega }(\boldsymbol{\theta }))$ . In this formulation, excess dependence due to the network is accounted for in $\boldsymbol{\Omega }$ . The parameters $\boldsymbol{\beta }$ and $\boldsymbol{\theta }$ of the distribution of the unobserved normal random variables $\{ z_{ij} \}_{ij}$ may be estimated using likelihood methods. For example, Ashford & Sowden (Reference Ashford and Sowden1970) propose likelihood ratio hypothesis tests and Ochi & Prentice (Reference Ochi and Prentice1984) give closed-form parameter estimators for studies of repeated observations on the same individual, such that $\boldsymbol{\Omega }(\boldsymbol{\theta })$ is block diagonal. In more general scenarios, such as unrestricted correlation structures, methods such as semi-parametrics (Connolly & Liang, Reference Connolly and Liang1988), pseudo-likelihoods (Le Cessie & Van Houwelingen, Reference Le Cessie and Van Houwelingen1994), and MCMC approximations to EM algorithms (Chib & Greenberg, Reference Chib and Greenberg1998; Li & Schafer, Reference Li and Schafer2008) are employed for estimation.
In this paper, we propose the probit exchangeable (PX) model, a parsimonious regression model for undirected binary network data based on an assumption of exchangeability of the unobserved normal random variables $\{ z_{ij} \}_{ij}$ . The assumption of exchangeability is pervasive in random network models and, in fact, underlies many of the latent variable network models (see Section 3 for a detailed discussion of exchangeability).Footnote 1 We show that, under exchangeability, the excess network dependence in $\{ z_{ij} \}_{ij}$ may be quantified using a single parameter $\rho$ such that $\boldsymbol{\Omega }(\boldsymbol{\theta }) = \boldsymbol{\Omega }(\rho )$ . This fact remains regardless of the particular exchangeable generating model, and thus, our approach can be seen as subsuming exchangeable latent network variable models, at least up to the second moment of their latent distributions. The proposed model may be rapidly estimated using an expectation-maximization (EM) algorithm to attain a numerical approximation to the maximum likelihood estimator, where we make approximations in the expectation step for runtime considerations. The estimation scheme we employ is similar to those used to estimate generalized linear mixed models in the literature (Littell et al., Reference Littell, Milliken, Stroup, Wolfinger and Oliver2006; Gelman & Hill, Reference Gelman and Hill2006).
This paper is organized as follows. As latent variable network models are strongly related to our work, we review them in detail in Section 2. We provide supporting theory for exchangeable random network models and their connections to latent variable network models in Section 3. In Section 4, we define the PX model and then the estimation thereof in Section 5. In Section 6, we give a method for making predictions on unobserved relations. We provide simulation studies demonstrating consistency of the proposed estimation algorithm and demonstrating the improvement with the proposed model over latent variable network models in estimating $\boldsymbol{\beta }$ in Section 7. We analyze a network of political books in Section 8, demonstrating the reduction in runtime when PX model, and compare its out-of-sample performance to existing latent variable network models. A discussion with an eye toward future work is provided in Section 9.
2. Latent variable network models
In this section, we briefly summarize a number of latent variable network models in the literature that are used to capture excess dependence in network observations. All latent variable network models we consider here may be written in the common form
where $\textbf{v}_i \in{\mathbb{R}}^K$ with mean $\textbf{0}$ and covariance matrix $\boldsymbol{\Sigma }_v$ , and $\mu _{ij}$ is fixed. We avoid specifying a distribution for the latent vectors $\{ \textbf{v}_{i} \}_{i=1}^n$ , although they are often taken to be multivariate Gaussian. We set the total variance of the latent variable representation to be $1 = \sigma ^2 + \textrm{var}[f_{\boldsymbol{\theta }} (\textbf{v}_i, \textbf{v}_j)]$ , since it is not identifiable. The function of the latent variables $f_{\boldsymbol{\theta }} \;:\;{\mathbb{R}}^K \times{\mathbb{R}}^K \rightarrow{\mathbb{R}}$ , parametrized by $\boldsymbol{\theta }$ , serves to distinguish the latent variable network models discussed below. Regression latent variable network models are formed when the latent mean is represented as a linear function of exogenous covariates ${\textbf{x}}_{ij} \in{\mathbb{R}}^p$ , such that $\mu _{ij} ={\textbf{x}}_{ij}^T{\boldsymbol{\beta }}$ . The latent nodal random vectors $\{ \textbf{v}_{i} \}_{i=1}^n$ represent excess network dependence—beyond the mean $\mu _{ij}$ . Since relations $y_{ij}$ and $y_{ik}$ share latent vector $\textbf{v}_i$ corresponding to shared actor $i$ , and thus, $y_{ij}$ and $y_{ik}$ have related distributions through the latent function $f_{\boldsymbol{\theta }} (\textbf{v}_i, \textbf{v}_j)$ . Many popular models for network data may be represented as in (2), such as the social relations model, the latent position model, and the latent eigenmodel.
2.1 Social relations model
The social relations model was first developed for continuous, directed network data (Warner et al., Reference Warner, Kenny and Stoto1979; Wong, Reference Wong1982; Snijders & Kenny, Reference Snijders and Kenny1999). In the social relations model for binary network data (Hoff, Reference Hoff2005), $f_{\boldsymbol{\theta }}(\textbf{v}_i, \textbf{v}_j) = \textbf{v}_i + \textbf{v}_j$ and $\textbf{v}_i = a_i \in \mathbb{R}$ for each actor $i$ , such that
Each actor’s latent variable $\{ a_i \}_{i=1}^n$ may be thought of as the actor’s sociability: large values of $a_i$ correspond to actors with a higher propensity to form relations in the network. The random $\{ a_i \}_{i=1}^n$ in (3) also accounts for the excess correlation in network data; any two relations that share an actor, for example, $y_{ij}$ and $y_{ik}$ , are marginally correlated.
2.2 Latent position model
A more complex model for representing excess dependence in social network data is the latent position model (Hoff et al., Reference Hoff, Raftery and Handcock2002). The latent position model extends the idea of the social relations model by giving each actor $i$ a latent position $\textbf{u}_i$ in a Euclidean latent space, for example ${\mathbb{R}}^{K}$ . Then, actors whose latent positions are closer together in Euclidean distance are more likely to share a relation:
In the form of (2), the latent position model contains latent random vector $\textbf{v}_{i} = [a_i, \textbf{u}_i]^T \in{\mathbb{R}}^{K+1}$ , and $f_{\boldsymbol{\theta }}(\textbf{v}_{i}, \textbf{v}_j ) = a_i + a_j - || \textbf{u}_i - \textbf{u}_j ||_2$ . Hoff et al. (Reference Hoff, Raftery and Handcock2002) show that the latent position model is capable of representing transitivity, that is, when $y_{ij} = 1$ and $y_{jk} = 1$ , it is more likely that $y_{ik} = 1$ . Models that are transitive often display a pattern observed in social network data: a friend of my friend is also my friend (Wasserman & Faust, Reference Wasserman and Faust1994).
2.3 Latent eigenmodel
The latent eigenmodel also associates each actor with a latent position $\textbf{u}_i$ in a latent Euclidean space; however, the inner product between latent positions (weighted by symmetric parameter matrix $\Lambda$ ) measures the propensity of actors $i$ and $j$ to form a relation, rather than the distance between positions (Hoff, Reference Hoff2008):
In the context of (2), the function $f_{\boldsymbol{\theta }}(\textbf{v}_{i}, \textbf{v}_{j}) = a_i + a_j + \textbf{u}_i^T \Lambda \textbf{u}_j$ for the latent eigenmodel, where the parameters $\boldsymbol{\theta }$ are the entries in $\Lambda$ and $\textbf{v}_{i} = [a_i, \textbf{u}_i]^T \in{\mathbb{R}}^{K+1}$ . Hoff (Reference Hoff2008) shows that the latent eigenmodel is capable of representing transitivity and that the latent eigenmodel generalizes the latent position model given sufficiently large dimension of the latent vectors $K$ .
In addition to transitivity, a second phenomenon observed in social networks is structural equivalence, wherein different groups of actors in the network form relations in a similar manner to others in their group. One form of structural equivalence is associative community structure, where the social network may be divided into groups of nodes that share many relations within group, but relatively few relations across groups. Such behavior is common when cliques are formed in high school social networks or around subgroups in online social networks. A form of structural equivalence is when actors in a given group are more likely to form relations with actors in other groups than with actors in their own group, for example, in networks of high-functioning brain regions when performing cognitively demanding tasks (Betzel et al., Reference Betzel, Bertolero and Bassett2018). Two models that are aimed at identifying subgroups of nodes that are structurally equivalent are the latent class model of Nowicki & Snijders (Reference Nowicki and Snijders2001) and the mixed membership stochastic blockmodel (Airoldi et al., Reference Airoldi, Blei, Fienberg and Xing2008). Hoff (Reference Hoff2008) shows that the latent eigenmodel is capable of representing stochastic equivalence in addition to transitivity and that the latent eigenmodel generalizes latent class models given sufficiently large dimension of the latent vectors $K$ . For this reason, we focus on the latent eigenmodel, and the simpler social relations model, as reference models in this paper.
2.4 Drawbacks
The latent variable network models discussed in this section were developed based on the patterns often observed in real-world social networks. Latent variable network models contain different terms to represent the social phenomena underlying these patterns, and thus, different models may lead to substantially different estimates of $\boldsymbol{\beta }$ . It may not be clear which model’s estimate of $\boldsymbol{\beta }$ , or which model’s prediction of $\{ y_{ij} \}_{ij}$ , is best. Generally, latent variable network models are evaluated using goodness-of-fit checks (Hunter et al., Reference Hunter, Goodreau and Handcock2008b), rather than rigorous tests, and it is well-known that selecting informative statistics for the goodness-of-fit checks is challenging. The latent variable network models described in this section are typically estimated using a Bayesian Markov chain Monte Carlo (MCMC) approach, which may be slow, especially for large data sets. Some recent advances do directly attempt to maximize the likelihood of network models with latent spaces (Ma et al., Reference Ma, Ma and Yuan2020; Zhang et al., Reference Zhang, Xu and Zhu2022); however, public software implementations of these methods do not appear available, and they require certain covariate types (relation-level and actor-level, respectively) and certain latent space structures, such as the Euclidean distance latent space.
3. Exchangeable network models
To motivate the formulation of the proposed model, we briefly discuss the theory of exchangeable random network models and their relationship to latent variable network models. A random network model for $\{\epsilon _{ij} \}_{ij}$ is exchangeable if the distribution of $\{\epsilon _{ij} \}_{ij}$ is invariant to permutations of the actor labels, that is, if
for any permutation $\pi (.)$ . There is a rich theory of exchangeable network models, dating back to work on exchangeable random matrices (Hoover, Reference Hoover1979; Aldous, Reference Aldous1981), upon which we draw in this section.
All the latent variable network models discussed in Section 2 have latent error networks $\{\epsilon _{ij} \}_{ij}$ that are exchangeable, where we define $\epsilon _{ij} = f_{\boldsymbol{\theta }}(\textbf{v}_{i}, \textbf{v}_{j}) +{\textbf{x}}i_{ij}$ from (2), the random portion of a general latent variable network model. Further, under constant mean $\mu _{ij} = \mu$ , all the latent variable network models for the observed network $\{ y_{ij} \}_{ij}$ in Section 2 are exchangeable. In fact, any exchangeable network model may be represented by a latent variable network model. Specifically, the theory of exchangeable network models states that every exchangeable random network model may be represented in the following form (see, for example, Lovász & Szegedy, Reference Lovász and Szegedy2006; Kallenberg, Reference Kallenberg2006):
where the function $h \;:\; [0,1] \times [0,1] \rightarrow{\mathbb{R}}$ has finite integral $\int _{[0,1] \times [0,1]} h(u,v) du dv \lt \infty$ and serves to distinguish the various exchangeable network models. It can be shown that (7) is equivalent to the graphon representation of exchangeable random network models, where the graphon is the canonical probabilistic object of exchangeable random network models (Lovász & Szegedy, Reference Lovász and Szegedy2006; Borgs et al., Reference Borgs, Chayes, Cohn and Zhao2014). Noting that we may always map the random scalar $u_i$ to some random vector $\textbf{v}_i$ , the expression in (7) illustrates how every exchangeable random network model may be represented by a latent variable network model in the sense of (2).
3.1 Covariance matrices of exchangeable network models
The expression in (7) shows that any exchangeable network model for binary network data must correspond to a latent random network $\{\epsilon _{ij} \}_{ij}$ that is continuous and exchangeable. The covariance matrix of any undirected exchangeable network model has the same form and contains at most two unique nonzero values. (Marrs et al. (Reference Marrs, Fosdick and McCormick2017) show that directed exchangeable network models with continuous values all have covariance matrices of the same form with at most five unique nonzero terms). This fact can be seen by simply considering the ways that any pair of relations can share an actor. In addition to a variance, the remaining covariances are between relations that do and do not share an actor:
where the indices $i,j,k,$ and $l$ are unique. It is easy to see the second equality holds for any pair of relations that share an actor by the exchangeability property, that is, by permuting the actor labels. The third equality results from the fact that the only random elements in (7) are the actor random variables $u_i$ , $u_j$ , and the random error ${\textbf{x}}i_{ij}$ . When the random variables corresponding to two relations $\epsilon _{ij}$ and $\epsilon _{kl}$ share no actor, the pair of relations are independent by the generating process. Finally, we note that exchangeable network models have relations that are marginally identically distributed, and thus, relations therein have the same expectation and variance. That said, in the generalized linear regression case of (2), the means $\mu _{ij} ={\textbf{x}}^T_{ij}{\boldsymbol{\beta }}$ are non-constant, and thus, the observations $\{y_{ij} \}_{ij}$ are not exchangeable; only the latent error network $\{ \epsilon _{ij} \}_{ij}$ is exchangeable in the generalized linear regression case. In the proposed model, rather than put forth a particular parametric model for the latent network $\{ \epsilon _{ij} \}_{ij}$ , we simply model the covariance structure outlined in (8), which is sufficient to represent the covariance structure of any exchangeable network model for the errors.
4. The probit exchangeable (PX) model
In this section, we propose the probit exchangeable network regression model, which we abbreviate as the “PX” model. In the PX model, the vectorized mean of the network is characterized by a linear combination of covariates, $\textbf{X}{\boldsymbol{\beta }}$ , where $\boldsymbol{\beta }$ is a $p$ -length vector of coefficients that are the subject of inference and $\textbf{X}$ is a $\binom{n}{2} \times p$ matrix of covariates. The excess network dependence beyond that captured in $\textbf{X}{\boldsymbol{\beta }}$ is represented by an unobservable mean-zero error vector $\boldsymbol{\epsilon }$ , a vectorization of $\{ \epsilon _{ij} \}_{ij}$ , that is exchangeable in the sense of (6). The PX model is
where we note that the variance of $\epsilon _{ij}$ is not identifiable, and thus, we choose $\textrm{var}[\epsilon _{ij}] = 1$ without loss of generality. We focus on normally distributed unobserved errors $\boldsymbol{\epsilon }$ in this paper; however, other common distributions, such as the logistic distribution, could be used. We note that the normal distribution assumption implies that (9) is a typical probit regression model, but with correlation among the observations due to network structure.
As discussed in Section 3, under the exchangeability assumption, the covariance matrix of the latent error network $\textrm{var}[\boldsymbol{\epsilon }] = \boldsymbol{\Omega }$ has at most two unique nonzero parameters. Taking $\textrm{var}[\epsilon _{ij}] = 1$ , the covariance matrix of $\boldsymbol{\epsilon }$ has a single parameter $\rho = \textrm{cov}[\epsilon _{ij}, \epsilon _{ik}]$ . We may thus write
where we define the binary matrices $\{ \mathcal{S}_{i} \}_{i=1}^3$ indicating unique entries in $\boldsymbol{\Omega }$ . The matrix $\S _1$ is a diagonal matrix indicating the locations of the variance in $\boldsymbol{\Omega }$ , and $\S _2$ and $\S _3$ indicate the locations in $\boldsymbol{\Omega }$ corresponding to the covariances $\textrm{cov}[\epsilon _{ij}, \epsilon _{ik}]$ , and $\textrm{cov}[\epsilon _{ij}, \textrm{cov}[\epsilon _{ij}, \epsilon _{kl}]$ , respectively, where the indices $i,j,k,$ and $l$ are unique.
The PX model unifies many of the latent variable network models discussed in Sections 2 and 3. Similar to (7), the PX model may be seen as representing the covariance structure of the latent variables $\{ f_{\boldsymbol{\theta }}(\textbf{v}_i, \textbf{v}_j) +{\textbf{x}}i_{ij} \}_{ij}$ with $\{ \epsilon _{ij} \}_{ij}$ , the unobservable error network of the PX model in (9). As both networks $\{ f_{\boldsymbol{\theta }}(\textbf{v}_i, \textbf{v}_j) +{\textbf{x}}i_{ij} \}_{ij}$ and $\{ \epsilon _{ij} \}_{ij}$ are exchangeable, they have covariance matrices of the same form (see discussion in Section 3). As every exchangeable random network model may be represented by a latent variable network model, the PX model may represent the latent correlation structure of any exchangeable network model, yet without specifying a particular exchangeable model. Further, we now show that the PX model is equivalent to the social relations model under certain conditions.
Proposition 4.1. Suppose that the random effects $\{a_i \}_{i=1}^n$ for the social relations model in (3) are normally distributed. Then, there exists $\rho \in [0, 1/2]$ such that $\{y_{ij} \}_{ij}$ in the PX model in (9) is equal in distribution to $\{y_{ij} \}_{ij}$ as specified by the social relations model in (3).
Proof. As the PX and social relations models are probit regression models with the same mean structure, given by $\textbf{X}{\boldsymbol{\beta }}$ , it is sufficient to show that their latent covariance matrices are equivalent, that is, that $\textrm{var}[ \{a_i + a_j +{\textbf{x}}i_{ij} \}_{ij} ] = \textrm{var}[ \{ \epsilon _{ij} \}_{ij} ]$ . By exchangeability, the latent covariance matrices of the PX and social relations models have the same form and by assumption have variance 1. It is easy to see that, given $\sigma ^2_a \le 1$ (a necessary condition for $\textrm{var}[\epsilon _{ij}]=1$ ), we may take $\rho = \sigma ^2_a/2$ for the PX model, which establishes equality in the model distributions.
Exact distributional equivalence between the PX model and latent variable models other than the social relations model will typically not hold. For example, the latent eigenmodel in (5) includes non-Gaussian random variables, so that exact distributional equivalence is impossible. Similarly, it appears likely that the general latent variable model in (2) may generate non-Gaussian random variables through the function $f_{\boldsymbol{\theta }} (\textbf{v}_i, \textbf{v}_j)$ . Importantly however, there does exist $\rho$ such that the covariance of the latent errors of every pair of relations, $\textrm{cov}[\epsilon _{ij}, \, \epsilon _{kl}]$ , is equal to the covariance of the latent errors in any exchangeable latent variable model, $\textrm{cov}[f_{\boldsymbol{\theta }} (\textbf{v}_i, \textbf{v}_j) +{\textbf{x}}i_{ij},\, f_{\boldsymbol{\theta }} (\textbf{v}_k, \textbf{v}_l) +{\textbf{x}}i_{kl}]$ . Hence, the PX model may be seen as a generalized exchangeable latent variable model that focuses all modeling effort on the first two moments of the data.
Proposition 4.1 states that the PX model and social relations model are equivalent under normality of their latent error networks. In principle, the social relations model is simply a generalized linear mixed model; however, existing software packages, such as lme4 in R (Bates et al., Reference Bates, Mächler, Bolker and Walker2015), do not appear to accommodate the random effects specification of the social relations model in (3) since the indices $i$ and $j$ pertain to random effects $a_i$ and $a_j$ from the same set (as opposed to $a_i$ and $b_j$ in a random crossed design). Nevertheless, the estimation scheme proposed in Section 5 employs the same strategies as those commonly used to estimate generalized linear mixed models (Littell et al., Reference Littell, Milliken, Stroup, Wolfinger and Oliver2006; Gelman & Hill, Reference Gelman and Hill2006). In the estimation algorithm in lme4, the marginal likelihood of the data is approximated and then maximized using numerical approximations with respect to $\boldsymbol{\beta }$ and random effects variance, for example $\sigma ^2_a$ in the social relations model. Rather than an approximate likelihood, we propose maximizing the true likelihood with respect to $\boldsymbol{\beta }$ and $\rho$ , yet also use numerical approximations to accomplish this maximization.
It is important to note that, although the latent errors $\{ \epsilon _{ij} \}_{ij}$ in the PX model form an exchangeable random network, the random network $y_{ij}$ represented by the PX model is almost certainly not exchangeable. For example, each $y_{ij}$ may have a different marginal expectation $\Phi \big({\textbf{x}}_{ij}^T{\boldsymbol{\beta }}\big)$ . Then, the relations in the network are not marginally identically distributed, which is a necessary condition for exchangeability. Further, the covariances between pairs of relations, say $y_{ij}$ and $y_{ik}$ , depend on the marginal expectations:
Here, $dF_\rho$ is the bivariate standard normal distribution with correlation $\rho$ . Since the covariance $\textrm{cov}[y_{ij}, y_{ik} ]$ depends on the latent means ${\textbf{x}}^T_{ij}{\boldsymbol{\beta }}$ and ${\textbf{x}}^T_{ik}{\boldsymbol{\beta }}$ , $\textrm{cov}[y_{ij}, y_{ik} ]$ is only equal to $\textrm{cov}[y_{ab}, y_{ac} ]$ when the latent means are equal. As a result, although the covariance matrix of the unobserved errors $\boldsymbol{\Omega }$ is of a simple form with entries $\{1, \rho, 0 \}$ , the covariances between elements of the vector of observed relations $\textbf{y}$ are heterogeneous (in general) and depend on $\rho$ in a generally more complicated way.
5. Estimation
In this section, we propose an estimator of $\{{\boldsymbol{\beta }}, \rho \}$ in the PX model that approximates the maximum likelihood estimator (MLE). The algorithm we propose is based on the EM algorithm (Dempster et al., Reference Dempster, Laird and Rubin1977). Although the covariance matrix for the PX model is highly structured, as in (10), a closed-form expression for the MLE does not appear available. While we explored pseduo-likelihood pairwise approximations (also called “composite likelihoods” in some literature) to the complete PX likelihood (Heagerty & Lele, Reference Heagerty and Lele1998), we found no substantial advantage—neither in performance nor runtime—over the proposed estimation scheme in this paper.
The proposed estimation algorithm consists of alternating computation of the expected complete likelihood with maximization with respect to $\rho$ and $\boldsymbol{\beta }$ , iterating until convergence. Since the algorithm iterates expectation and two maximization steps, we term it the EMM algorithm. To improve algorithm efficiency, we initialize $\boldsymbol{\beta }$ at the ordinary probit regression estimator (assuming independence of the latent errors), and initialize $\rho$ with a mixture estimator based on possible values of $\rho$ such that $\boldsymbol{\Omega }$ is positive definite, as detailed in Section A.1. The complete EMM algorithm is presented in Algorithm 1. In the following text, we detail the EMM algorithm, beginning with maximization with respect to $\rho$ , and then proceeding to maximization with respect to $\boldsymbol{\beta }$ . We define $\gamma _i = E\!\left[\boldsymbol{\epsilon }^T \mathcal{S}_i \boldsymbol{\epsilon } \mid \textbf{y},{\widehat{\rho }}^{(\nu )}, \widehat{{\boldsymbol{\beta }}}^{(\nu )} \right]/ |\Theta _i|$ , where $\Theta _i$ is the set of relation pairs indicated by binary matrices $\mathcal{S}_i$ . By default, we typically set $\tau = 10^{-2}$ and $\delta = 10^{-1}$ .
5.1 Expectation
Consider the log-likelihood, $\ell _{\textbf{z}}$ , of the latent continuous random vector $\textbf{z}$ . Taking the expectation of $\ell _{\textbf{z}}$ conditional on $\textbf{y}$ , the expectation step for a given iteration $\nu$ of the EM algorithm is
where ${\widehat{\rho }}^{(\nu )}$ and $\widehat{{\boldsymbol{\beta }}}^{(\nu )}$ are the estimators of $\rho$ and $\boldsymbol{\beta }$ at iteration $\nu$ . In discussing the maximization step for $\rho$ , we will show that that the $\rho$ update depends on the data through the expectations denoted by $\gamma _i$ for $i \in \{1,2,3 \}$ . In discussing the maximization step for $\beta$ , we will show that that the $\beta$ update depends on the data only through the expectation $E\!\left[\boldsymbol{\epsilon } \, | \, \textbf{y},{\widehat{\rho }}^{(\nu )}, \widehat{{\boldsymbol{\beta }}}^{(\nu )} \right]$ .
5.1.1 Approximations
The computation of $E\!\left[\boldsymbol{\epsilon } \, | \, \textbf{y},{\widehat{\rho }}^{(\nu )}, \widehat{{\boldsymbol{\beta }}}^{(\nu )}\right]$ in (14) is nontrivial, as it is a $\binom{n}{2}$ -dimensional truncated multivariate normal integral. We exploit the structure of $\boldsymbol{\Omega }$ to compute $E\!\left[\boldsymbol{\epsilon } \, | \, \textbf{y},{\widehat{\rho }}^{(\nu )}, \widehat{{\boldsymbol{\beta }}}^{(\nu )}\right]$ using the law of total expectation. A Newton-Raphson algorithm, along with an approximate matrix inverse, is employed to compute an approximation of $E\!\left[\boldsymbol{\epsilon } \, | \, \textbf{y},{\widehat{\rho }}^{(\nu )}, \widehat{{\boldsymbol{\beta }}}^{(\nu )}\right]$ . Details of the implementation of the approximations are given in Section A.2.
The expectations $\{ \gamma _i \}_{i=1}^3$ require the computation of $\binom{n}{2}$ -dimensional truncated multivariate normal integrals, which are onerous for even small networks. Thus, we make two approximations to $\{ \gamma _i \}_{i=1}^3$ to reduce the runtime of the EMM algorithm. First, we compute the expectations conditioning only on the entries in $\textbf{y}$ that correspond to the entries in $\boldsymbol{\epsilon }$ being integrated, for example, instead of computing $E[\epsilon _{jk} \epsilon _{lm} \, | \, \textbf{y}]$ , we compute $E[\epsilon _{jk} \epsilon _{lm} \, | \, y_{jk}, y_{lm}]$ . This first approximation is most appropriate when $\rho$ is small, since $y_{lm}$ is maximally informative for $\epsilon _{jk}$ when $\rho$ is large (for $l,m,j,$ and $k$ distinct). Second, we find empirically that $\gamma _2 = E\!\left[\boldsymbol{\epsilon }^T \mathcal{S}_2 \boldsymbol{\epsilon } \, | \, \textbf{y} \right]/ |\Theta _2|$ is approximately linear in $\rho$ , since this sample mean of conditional expectations concentrates around a linear function of $\rho$ . Thus, we compute $\gamma _2$ for $\rho =0$ and $\rho = 1$ and use a line connecting these two values to compute $\gamma _2$ for arbitrary values of $\rho$ (see evidence of linearity of $\gamma _2$ for the political books network in Appendix E). The details of the approximations to $\{ \gamma _i \}_{i=1}^3$ are given in Section A.3.
5.2 Maximization with respect to $\boldsymbol\rho$
To derive the maximization step for $\rho$ , we use the method of Lagrange multipliers, since differentiating (11) directly with respect to $\rho$ gives complex nonlinear equations that are not easily solvable. We first define the set of parameters $\{\phi _i \}_{i=1}^3$ , representing the variance and two possible covariances in $\boldsymbol{\Omega }$ ,
where the indices $i,j,k,$ and $l$ are distinct. In addition, we let $\textbf{p} = [p_1, p_2, p_3]$ parametrize the precision matrix $\boldsymbol{\Omega }^{-1} = \sum _{i=1}^3 p_i \mathcal{S}_i$ , which has the same form as the covariance matrix $\boldsymbol{\Omega }$ (see Marrs et al. (Reference Marrs, Fosdick and McCormick2017) for a similar result when $\{ \epsilon _{ij} \}_{ij}$ forms a directed network). The objective function, incorporating the restrictions that $\phi _1 = 1$ and $\phi _3 = 0$ , is
where $\boldsymbol{\phi } = [\phi _1, \phi _2, \phi _3]$ and the “ $\frac{1}{2}$ ” factors are included to simplify algebra. Then, differentiating $Q_{\textbf{y}}$ with respect to $\textbf{p}$ , $\lambda _1$ , and $\lambda _3$ , the estimators for $\rho$ , $\{ \lambda _1, \lambda _3 \}$ are
where again $ \Theta _i$ is the set of pairs of relations $(jk, lm)$ that share an actor in the $i^{\textrm{th}}$ manner, for $i \in \{1,2,3 \}$ . For instance, $\Theta _2$ consists of pairs of relations of the form $(jk, jl)$ , where $j,k,$ and $l$ are distinct indices. In (12) and (13), the partial derivatives $\left \{ \partial \phi _i/ \partial p_j \right \}$ are available in closed form and are easily computable in $O(1)$ time using the forms of $\boldsymbol{\Omega }$ and $\boldsymbol{\Omega }^{-1}$ . See Appendix B for details.
Alternation of the estimators for $\rho$ and $\{ \lambda _1, \lambda _3 \}$ in (12) and (13) constitutes a block coordinate descent for $\rho = \phi _2$ subject to the constraints $\phi _1 = 1$ and $\phi _3 = 0$ . This block coordinate descent makes up the maximization step of the EMM algorithm for $\rho$ .
5.3 Maximization with respect to $\beta$
The maximization step with respect to $\boldsymbol{\beta }$ in the EMM algorithm can be obtained directly. Setting the derivative of (11) with respect to $\boldsymbol{\beta }$ equal to zero, the maximization step for $\boldsymbol{\beta }$ is
where we use the identity $\boldsymbol{\epsilon } = \textbf{z} - \textbf{X}{\boldsymbol{\beta }}$ . In Appendix B, we show that the leading terms of the unique entries in $\Omega ^{-1}$ , $\textbf{p}$ , depend only on $\rho$ through a multiplicative factor,
Thus, we may factor $f(\rho )$ out of (14), and the $\boldsymbol{\beta }$ maximization is asymptotically $\rho$ -free (except for the expectation term). Similarly, the maximization with respect to $\rho$ in Section 5.2 is free from $\boldsymbol{\beta }$ except for the expectation term. Hence, only a single maximization step with respect to each $\boldsymbol{\beta }$ and $\rho$ is required for each expectation.
5.4 Consistency of the EMM estimator
The complete multivariate normal likelihood for $\textbf{z}$ is a non-curved, identifiable likelihood. Then, it is known that each expectation and maximization step in an EM algorithm increases the current likelihood value (Wu, Reference Wu1983). Whenever there is a unique, single local maximum, the EM algorithm yields consistent, and efficient, estimators. We make a series of approximations to the expectations in the EMM algorithm to reduce computational demands, so that the theory of EM estimator convergence may not be directly applicable. Yet, we find that the EMM estimators, $\widehat{\beta }_{\textrm{EMM}}, \widehat{\rho }_{\textrm{EMM}} \}$ , maintain consistency. Taking the leading terms of ${\widehat{\rho }}_{\textrm{EMM}}$ ,
which has expectation $E[{\widehat{\rho }}_{\textrm{EMM}}] = \rho + O(n^{-1})$ . Then, consistency can be established by showing that the variance of ${\widehat{\rho }}_{\textrm{EMM}}$ tends to zero. We provide details in Appendix C. The estimator $\widehat{\beta }_{\textrm{EMM}}$ is particularly difficult to analyze, as $E[\epsilon _{jk} \mid \textbf{y}]$ depends on every entry in $\textbf{y}$ , and because we approximate this expectation. We provide a sketch for a proof of consistency in Appendix C by bounding the distance between the EMM estimator for $\beta$ and the true MLE, $\left|\!\left|\widehat{\beta }_{\textrm{EMM}} - \widehat{\beta }_{\textrm{MLE}} \right|\right|_2^2$ , using an easier-to-analyze estimator for $\beta$ which replaces $E[\epsilon _{jk} \mid \textbf{y}]$ with $E[\epsilon _{jk} \mid y_{jk}]$ . As in the argument for consistency of ${\widehat{\rho }}_{\textrm{EMM}}$ , we establish consistency of the bounding estimator by showing the expectation is asymptotically equal to the true value of $\beta$ and that the variance of the bounding estimator tends to zero. We also discuss performance of $\widehat{\beta }_{\textrm{EMM}}$ under model misspecification, showing that it maintains consistency even under violation of the normality and exchangeability assumptions.
6. Prediction
In this section, we describe how to use the PX model, and the approximations in service of Algorithm 1, to make predictions for an unobserved network relation without undue computational cost. The predicted value we seek is the probability of observing $y_{jk} = 1$ given all the other values $\textbf{y}_{-jk}$ , where $\textbf{y}_{-jk}$ is the vector of observations $\textbf{y}$ excluding the single relation $jk$ . As in estimation, the desired probability is again equal to a $\binom{n}{2}$ -dimensional multivariate truncated normal integral, which is computationally burdensome. Thus, we approximate the desired prediction probability
The approximation in (15) is based on the fact that $[\epsilon _{jk} \, | \,{\boldsymbol{\epsilon }}_{-jk}]$ is normally distributed:
where $\textbf{1}_{jk}$ is the vector of all zeros with a one in the position corresponding to relation $jk$ and, for notational simplicity, we define $\widetilde{\boldsymbol{\epsilon }}_{-jk}$ is the vector $\boldsymbol{\epsilon }$ with a zero in the entry corresponding to relation $jk$ . We note that the diagonal of the matrix $p_2 \mathcal{S}_2 + p_3 \mathcal{S}_3$ consists of all zeros so that $m_{jk}$ is free of $\epsilon _{jk}$ . Then, the inner expectation in (15) is
Of course, $m_{jk}$ depends on ${\boldsymbol{\epsilon }}_{-jk}$ which is unknown, and thus, we replace $m_{jk}$ with its conditional expectation $E[m_{jk} \, | \, \textbf{y}_{-jk}] = E[\epsilon _{jk} \, | \, \textbf{y}_{-jk}]$ .
Computing $E[\epsilon _{jk} \, | \, \textbf{y}_{-jk}]$ is extremely difficult; however, computing $E[\epsilon _{jk} \, | \, \textbf{y}]$ proves feasible if we exploit the structure of $\boldsymbol{\Omega }$ . Thus, we approximate the desired expectation by imputing $y_{jk}$ with the mode of the observed data:
where $y^*$ is the mode of $\textbf{y}_{-jk}$ . The error due to this approximation is small and shrinks as $n$ grows. Substituting (18) for $m_{jk}$ in (17) gives the final expression in (15).
7. Simulation studies
In this section, we describe three simulation studies. The first verifies that the performance of the EMM estimator in Algorithm 1 provides improvement over standard probit regression. The second simulation study verifies consistency of the EMM estimators of $\boldsymbol{\beta }$ , and compares the performance of these estimators to the estimators of $\boldsymbol{\beta }$ from the social relations model and the latent eigenmodel. The third simulation study evaluates the robustness of the PX model, and EMM algorithm, to the assumption that the latent random variables are normally distributed.
For both simulation studies, we generated data with mean consisting of three covariates and an intercept:
In the model in (19), $\beta _0$ is an intercept; $\beta _1$ is a coefficient on a binary indicator of whether individuals $i$ and $j$ both belong to a pre-specified class $C$ ; $\beta _2$ is a coefficient on the absolute difference of a continuous, actor-specific covariate $x_{2i}$ ; and $\beta _3$ is that for a pair-specific continuous covariate $x_{3ij}$ . We fixed ${\boldsymbol{\beta }} = [\beta _0, \beta _1, \beta _2, \beta _3]^T$ at a single set of values. Since the accuracy of estimators of $\boldsymbol{\beta }$ may depend on $\textbf{X}$ , we generated $20$ random design matrices $\textbf{X}$ for each sample size of $n \in \{ 20, 40, 80 \}$ actors. We emphasize that, although these may appear to be only moderately sized networks, each consists of $\binom{n}{2} \in \{ 190, \, 780, \, 3160\}$ observations. For each design matrix, we simulated 100 error realizations of $\{ \epsilon _{ij} \}_{ij}$ , with distribution that depended on the generating model. When generating from the PX model, half of the total variance in $\epsilon _{ij}$ was due to correlation $\rho = 1/4$ , and the remaining half was due to the unit variance of $\epsilon _{ij}$ . When generating from the latent eigenmodel in (5), one-third of the variance in $\epsilon _{ij}$ was due to each term $a_i + a_j$ , $\textbf{u}_i^T \Lambda \textbf{u}_j$ , and ${\textbf{x}}i_{ij}$ , respectively. For additional details of the simulation study procedures, see Section D.1.
7.1 Evaluation of approximations in Algorithm 1
To evaluate the efficacy of the approximations described in the estimation procedure in Algorithm 1, we simulated from (19) for a single $\textbf{X}$ with $n=40$ (larger $n$ caused multivariate normal integral failures in R). We simulated 100 networks from the PX model in (9) using this $\textbf{X}$ , for each value of $\rho \in \{0.1, 0.25, 0.4 \}$ (we note that we require $\rho \lt 1/2$ for the error covariance matrix $\Omega$ to be positive definite). For each realization, we estimated $\beta$ in the PX model using EMM in Algorithm 1. To estimate $\beta$ in the standard probit model, we used the function glm in R. To compute the MLE, we numerically optimized the data log-likelihood using the Broyden-Fletcher-Goldfarb-Shanno algorithm as implemented in the optim function in R, initializing at the true values of $\{ \beta, \rho \}$ .
In the left panel of Figure 1, we evaluate the performance of the EMM estimator by comparing the root mean square error (RMSE) between the EMM coefficient estimate, $\widehat{\beta }_{\textrm{EMM}}$ , and the MLE obtained by the optimization procedure $\widehat{\beta }_{\textrm{MLE}}$ . As a baseline, we compute the RMSE between $\widehat{\beta }_{\textrm{MLE}}$ and the true value $\beta$ . If the approximations in the EMM algorithm are small, we expect the RMSE between $\widehat{\beta }_{\textrm{EMM}}$ and $\widehat{\beta }_{\textrm{MLE}}$ to be much smaller than the RMSE between $\widehat{\beta }_{\textrm{MLE}}$ and $\beta$ . Generally, the RMSE between $\widehat{\beta }_{\textrm{EMM}}$ and $\widehat{\beta }_{\textrm{MLE}}$ is smaller than the RMSE between $\widehat{\beta }_{\textrm{MLE}}$ and $\beta$ . However, the discrepancy between the two RMSEs decreases as the true $\rho$ grows. As a reference, the MSE between $\widehat{\beta }_{\textrm{Std. probit}}$ and $\widehat{\beta }_{\textrm{MLE}}$ is also shown in the left panel of Figure 1; the EMM estimator is closer to $\widehat{\beta }_{\textrm{MLE}}$ than the standard probit estimator is to $\widehat{\beta }_{\textrm{MLE}}$ for all values of $\rho$ . Raw RMSE values between the estimators and the truth, shown in Figure 2, confirm that the EMM algorithm does perform better than standard probit in RMSE with respect to estimation of $\beta$ . The results of this simulation study suggest that the EMM algorithm improves estimation of $\beta$ over the standard probit estimator for $\rho \gt 0$ and that the EMM estimator is reasonably close to the MLE, signifying the approximations in the EMM algorithm are reasonable.
In the right panel of Figure 1, the EMM estimator of $\rho$ is closer to the MLE, ${\widehat{\rho }}_{\textrm{MLE}}$ , than the MLE is close to the true value of $\rho$ for all values of $\rho$ examined. This fact suggests that the approximation error in estimating $\rho$ in the EMM algorithm is small. Further, the raw RMSE values shown in Figure 2 illustrate that ${\widehat{\rho }}_{\textrm{EMM}}$ may be as good an estimator of $\rho$ as is ${\widehat{\rho }}_{\textrm{MLE}}$ . The approximations in ${\widehat{\rho }}_{\textrm{EMM}}$ appear to be stable over the range of $\rho$ values examined. Overall, since the degradation in performance of the EMM algorithm is most pronounced in estimation of $\beta$ , we postulate that the degradation may be due to the approximations in computing $E[\epsilon _{jk} \mid \textbf{y}]$ (see Section A.2).
7.2 Performance in estimation of $\beta$
To evaluate the performance of the PX estimator in estimating linear coefficients $\boldsymbol{\beta }$ , we compared estimates of $\boldsymbol{\beta }$ by the EMM algorithm to estimators of the social relations and latent eigenmodels on data generated from the PX model and data generated from the latent eigenmodel. We used the amen package in R to estimate the social relations model and latent eigenmodel (Hoff et al., Reference Hoff, Fosdick, Volfovsky and He2017). We again compared these estimators to the standard probit regression model assuming independence as a baseline, which we estimated using the function glm in R. We focused on the value of $\rho =0.25$ , in the center of the range of possible $\rho$ values.
In Figure 3, we plot the RMSE (scaled by $n^{1/2}$ ) of the $\beta$ coefficients estimated for the PX model, standard probit model, social relations model, and latent eigenmodels. We see that the EMM estimator for the PX model has a downward trend in $n^{1/2}$ RMSE with $n$ , and a reducing spread of $n^{1/2}$ RMSE with $n$ , for both the PX and latent eigenmodel generating models. These facts suggest that the PX estimator is consistent for $\boldsymbol{\beta }$ , at a rate $n^{1/2}$ or better, for both the PX and latent eigenmodel generating models, confirming the claims in Section 5.4. Further, the EMM estimator has the lowest median $n^{1/2}$ RMSE of any of the estimators for all entries in $\boldsymbol{\beta }$ , where $n^{1/2}$ RMSE is evaluated for each $\textbf{X}$ realization (across the error realizations), and the median is computed across the 20 $\textbf{X}$ realizations. We observe similar patterns for the correlation parameter $\rho$ ; see Section D.1. Interestingly, the superiority of the PX estimator holds whether we generate from the PX or latent eigenmodel, which suggests that any benefit in correctly specifying the latent eigenmodel is lost in the estimating routine. The larger $n^{1/2}$ RMSEs of the amen estimator of the social relations and latent eigenmodels are a result of bias; see Section D.1 for bias-variance decomposition of the MSEs.
7.3 Runtimes
We evaluated the average runtimes of the algorithms used to estimate the simulated data. The average runtimes are plotted in Figure 4. The improvement in runtime offered by the EMM estimation scheme over SRM and LE MCMC estimation is several orders of magnitude. Interestingly, the runtime cost of EMM appears to grow faster than the MCMC routines and faster than standard probit regression. A contributing factor is the sum over $O(n^3)$ terms in the maximization of $\rho$ in the EMM algorithm. We have experimented with using only a random subset of $O(n^2)$ relation pairs in the maximization step, which results in gains in runtime with small cost in estimation performance. Such a tradeoff may become attractive for networks of sufficient size $n$ .
7.4 Evaluation of latent normality assumption
To evaluate the performance of the PX model under violation of the normality assumption on the latent errors ${\{ \epsilon _{ij}\}}_{ij}$ , we repeated the simulation study with $t$ -distributed latent random variables. Specifically, we simulated from (19), but replaced the latent error vector $\boldsymbol{\epsilon }$ with $\sigma ^{-1}$ $\Omega ^{1/2}{\textbf{u}}$ , where $\textbf{u}$ consists of independently and identically distributed $t$ random variables with 5 degrees of freedom. The scaling factor $\sigma = \sqrt{5/3}$ ensures that $\textbf{u}$ has unit population variance, for consistency with the Gaussian case, and $\Omega ^{1/2}$ is the matrix square root of $\Omega$ , with $\rho =0.25$ . This model thus has the same latent mean and covariance matrix as in the Gaussian case, but the latent errors have substantially heavier tails.
The left panel of Figure 5 shows the performance of the EMM algorithm in estimating ${\boldsymbol{\beta }}_1$ , compared to the standard probit regression estimates. As in the Gaussian case, the EMM algorithm produces estimates with $n^{1/2}$ RMSE tending to zero as $n$ grows. Also as in the Gaussian case, EMM estimation of the PX model improves estimation of $\boldsymbol{\beta }$ over standard probit regression. We observed the same results in estimation of the remaining coefficients (see Section D.2). Unlike the Gaussian case, $n^{1/2}$ RMSE in estimating $\rho$ did not appear to tend towards zero. However, in the right panel of Figure 5, the error in estimating $\rho$ scaled by $n^{1/4}$ , $n^{1/4}$ RMSE, does tend towards zero. This study confirms the claim in Section 5.4 that the EMM algorithm produces consistent estimators $\widehat{{\boldsymbol{\beta }}},{\widehat{\rho }} \}$ , even under violation of the normality assumption of the PX model.
8. Analysis of a network of political books
We live in a time of political polarization. We investigate this phenomenon by analyzing a network of $n=105$ books on American politics published around the time of the 2004 presidential election.Footnote 2 These data were compiled by Dr. Valdis Krebs using the “customers who bought this book also bought these books” list on Amazon.com. At the time, when browsing a particular book, Amazon listed the books that were bought by individuals who also bought the book in question. Thus, a relation between two books in the network indicates that they were frequently purchased by the same buyer on Amazon. Political books on the best-seller list of The New York Times were used as actors in the network. Finally, the books were labeled as conservative, liberal, or neutral based on each book’s description (Figure 6). Work by Dr. Krebs on a similar network was in a 2004 New York Times article (Eakin, Reference Eakin2004), where it was shown that there were many relations between books with similar ideologies yet relatively few across ideologies. The work by Dr. Krebs has inspired similar analyses of book purchasing networks in the fields of nanotechnology (Schummer, Reference Schummer2005) and climate science (Shi et al., Reference Shi, Shi, Dokshin, Evans and Macy2017).
To confirm previous work by Dr. Krebs, we develop a model that assigns a different probability of edge formation between books $i$ and $j$ depending on whether the books are ideologically aligned. By examining the network in Figure 6, we observe that neutral books appear to have ties than books that are labeled conservative or liberal. Thus, we add a nodal effect indicating whether either book in a relation is labeled neutral. The regression model specified is
where $c(i)$ represents the class of book $i$ (neutral, conservative, or liberal), and the distribution and covariance matrix of $\boldsymbol{\epsilon }$ are determined by the particular model being estimated. In this section, we estimate the PX model (PX), the equivalent social relations model (SRM), the latent eigenmodel (LE), and, as a baseline, the standard probit regression model assuming independence of observations (which we label “std. probit”).
We used a 10-fold cross-validation to compare the out-of-sample predictive performance of the estimators and the runtimes of the algorithms for the models in question. We used the proposed EMM algorithm to estimate the PX model, the amen package in R to estimate the social relations model and latent eigenmodel (Hoff et al., Reference Hoff, Fosdick, Volfovsky and He2017), and the glm(.) command in the R package stats to estimate the standard probit model. We randomly divided the $\binom{105}{2}$ relations into 10 disjoint sets, termed “folds,” of roughly the same size. Then, for each fold, we estimated the models on the remaining nine folds and made predictions for the data in the fold that was not used for estimation (for details of estimation of the PX model with missing data, see Section A.4). Repeating this operation for each complete data set of out-of-sample predictions for each estimating model. The procedure to make marginal predictions from the PX model is described in Section 6. To compare with the PX model, we make marginal predictions from the social relations model and the latent eigenmodel, that is, by integrating over the random effect space. The predictions from the social relations model and the latent eigenmodel are automatically output from amen in the presence of missing data. The predictions from the standard probit model are marginal by default as there is no correlation structure.
We use area under the precision-recall curve (PRAUC) to measure performance of the predictions relative to the observed data, although using area under the receiver operating characteristic (ROC) yields the same conclusions (see Appendix E). In Figure 6, the proposed EMM estimator produces an improvement in PRAUC over standard probit prediction that is roughly equivalent to the improvement of the social relations model over standard probit, yet with an average runtime that is 45 times faster (about a minute compared with an hour). The latent eigenmodel produces an improvement in PRAUC over the proposed EMM algorithm and the social relations model, however, at the expense of significant increase in average runtime, that of about 3,000 times slower than EMM and taking almost three days to complete. Note that we selected the number of MCMC iterations for the social relations and latent eigenmodels that resulted in sets of samples from the posterior distributions (after burn-in) that had a effective sample sizes roughly equal to 100 independent samples of the $\boldsymbol{\beta }$ parameters. Increasing the number of iterations, which may be desirable, would result in even longer runtimes for the estimators of the social relations and latent eigenmodels. Taken together, the results of the cross-validation study suggest that the PX model accounts for a large portion of the correlation in network data with estimation runtime that, depending upon stopping criterion, is orders of magnitude faster the runtime than existing approaches.
To estimate the complete data set under the mean model in (20), we used the EMM algorithm for the PX model and the amen package for the social relations model (SRM) and latent eigenmodel (LE), which we ran for $1\times 10^6$ iterations after a burn-in of $5 \times 10^4$ iterations (with runtimes of roughly two hours for SRM and 17 hours for LE). The coefficient estimates in Table 1 suggest that books that share the same ideology are more likely to be frequently purchased together, as all $\widehat{\beta }_1 \gt 0$ . This positive coefficient estimate demonstrates political polarization in the network: conservative books are more likely to be purchased with other conservative books rather than with liberal books. The second coefficient estimate, $\widehat{\beta }_2 \gt 0$ , suggests that, relative to a random pair of ideologically misaligned books, pairs of books where at least one of the books is neutral are more likely to be purchased together. Neutral books are thus generally more likely to be purchased with books of disparate ideologies and have a unifying effect in the book network. Returning briefly to Table 1, the runtimes highlight that EMM reduces computational burden by order(s) of magnitude over existing approaches.
9. Discussion
In this paper, we present the PX model, a probit regression model for undirected, binary networks. The PX model adds a single parameter—latent correlation $\rho$ —to the ordinary probit regression model that assumes independence of observations. Our focus in this paper is estimation of the effects of exogenous covariates on the observed network, $\boldsymbol{\beta }$ , and prediction of unobserved network relations. Thus, we do not present uncertainty estimators for $\widehat{{\boldsymbol{\beta }}}$ or $\widehat{\rho }$ . However, practitioners estimating the PX model may require uncertainty estimators to perform inference. Development and evaluation of estimators of the uncertainty in estimators of network data is nontrivial; indeed, entire papers are dedicated to this task for the simpler linear regression case (see, for example, Aronow et al., Reference Aronow, Samii and Assenova2015; Marrs et al., Reference Marrs, Fosdick and McCormick2017). Future development of uncertainty estimators for the PX model may draw upon existing literature for uncertainty in EM estimators (Louis, Reference Louis1982) and the numerical approximations in this paper.
A popular notion in the analysis of network data is the presence of higher-order dependencies, meaning beyond second order (Hoff, Reference Hoff2005). The representation of triadic closure, a form of transitivity—the friend of my friend is likely to also be my friend— is one motivation for the latent eigenmodel (Hoff, Reference Hoff2008). The PX model does represent triadic closure to a degree. One can show that, given two edges of a triangle relation exist, $y_{ij} = y_{jk} = 1$ , the probability that the third edge exists, ${\mathbb{P}}(y_{ik} = 1)$ , increases as $\rho$ increases. However, the increase in probability describing triadic closure under the PX model is fixed based on the estimated value of $\rho$ , which is informed only by the first two moments of the data when using the EMM estimator. It may be desirable to develop a test for whether the PX model sufficiently represents the level of triadic closure as suggested by the data. One such test might compute the empirical probability that ${\mathbb{P}}(y_{ik} = 1 \, | \, y_{ij} = y_{jk} = 1)$ and compare this statistic to its distribution under the null that the PX model is the true model with correlation parameter $\rho ={\widehat{\rho }}$ . Future work consists of theoretical development of the distributions of the test statistic(s) of choice under the null. Statistics of interest will likely be related to various clustering coefficients in the network literature (Wasserman & Faust, Reference Wasserman and Faust1994; Watts & Strogatz, Reference Watts and Strogatz1998).
We focus on the probit model in this paper. However, we find that this choice may limit the degree of covariance in the observed network $\{ y_{ij} \}_{ij}$ that the PX model can represent. For constant mean ${\textbf{x}}_{ij}^T{\boldsymbol{\beta }} = \mu$ , the maximum covariance the PX model can represent is bounded by
where $dF_\rho$ is the bivariate standard normal distribution with correlation $\rho$ . The use of different latent distributions for $\boldsymbol{\epsilon }$ other than normal may allow a model analogous to the PX model to represent a larger range of observed covariances $\textrm{cov}[y_{ij}, y_{ik}]$ . Future work may consider a logistic distribution for $\boldsymbol{\epsilon }$ , as some researchers prefer to make inference with logistic regression models for binary data due to the ease of interpretation.
Acknowledgements
This work utilized the RMACC Summit supercomputer, which is supported by the National Science Foundation (awards ACI-1532235 and ACI-1532236), the University of Colorado Boulder, and Colorado State University. The RMACC Summit supercomputer is a joint effort of the University of Colorado Boulder and Colorado State University. This work was also partially supported by the National Science Foundation under Grant no. 1856229.
Competing interests
None.
Supplementary materials
For supplementary material for this article, please visit http://doi.org/10.1017/nws.2023.12.
A. Details of estimation
In this section, we supply details of estimation in support of Algorithm 1, beginning with the initialization of $\rho$ . We then provide details of computing the expectations of $\ell _{\textbf{y}}$ need for $\beta$ maximization and then details of computing the expectations of $\ell _{\textbf{y}}$ need for $\rho$ maximization. We close the section with the handling of missing data in the EMM algorithm.
A.1 Initialization of $\rho$ estimator
An EM algorithm may take many iterations to converge, and selecting a starting point near the optima may significantly reduce the number of iterations required. We present a method of initializing ${\widehat{\rho }}^{(0)}$ using a mixture estimator. By examining the eigenvalues of $\boldsymbol{\Omega }$ , it can be shown that $\rho$ lies in the interval $[0, 1/2)$ when $\boldsymbol{\Omega }$ is positive definite for arbitrary $n$ (Marrs et al., Reference Marrs, Fosdick and McCormick2017). Thus, ${\widehat{\rho }} = 0.25$ is a natural naive initialization point as it is the midpoint of the range of possible values. However, we also allow the data to influence the initialization point by taking a random subset $\mathcal{A}$ of $\Theta _2$ of size $2n^2$ and estimating $\rho$ using the data corresponding to relations in $\mathcal{A}$ . Then, the final initialization point is defined as a mixture between the naive estimate ${\widehat{\rho }} = 0.25$ and the estimate based on the data. We weight the naive value as if it arose from $100n$ samples, such that the weights are even at $n=50$ , and for increasing $n$ , the data estimate dominates:
We compute the average $\frac{1}{|\mathcal{A}|}\sum _{jk,lm \in \mathcal{A}} E[ \epsilon _{jk} \epsilon _{lm} \, | \, y_{jk}, y_{lm} ]$ using the linearization approach described in Section A.3.
A.2 Implementation of $\beta$ expectation step
Under general correlation structure, computation of the expectation $E[\boldsymbol{\epsilon } \, | \, \textbf{y}]$ (step 1 in Algorithm 1, where we drop conditioning on $\rho ^{(\nu )}$ and ${\boldsymbol{\beta }}^{(\nu )}$ to lighten notation) for even small networks is prohibitive, since this expectation is an $\binom{n}{2}$ -dimensional truncated multivariate normal integral. We exploit the structure of $\boldsymbol{\Omega }$ to compute $E[\boldsymbol{\epsilon } \, | \, \textbf{y}]$ using the law of total expectation and a Newton-Raphson algorithm.
First, we take a single relation $jk$ and use the law of total expectation to write
where ${\boldsymbol{\epsilon }}_{-jk}$ is the vector of all entries in $\boldsymbol{\epsilon }$ except relation $jk$ . Beginning with the innermost conditional expectation, the distribution of $\epsilon _{jk}$ given ${\boldsymbol{\epsilon }}_{-jk}$ and $y_{jk}$ is truncated univariate normal, where the untruncated normal random variable has the mean and variance of $\epsilon _{jk}$ given ${\boldsymbol{\epsilon }}_{-jk}$ . Based on the conditional multivarite normal distribution and the form of the inverse covariance matrix $\boldsymbol{\Omega }^{-1} = \sum _{i=1}^3 p_i \mathcal{S}_i$ , we may write the untruncated distribution directly as
where $\textbf{1}_{jk}$ is the vector of all zeros with a one in the position corresponding to relation $jk$ and, for notational purposes, we define $\widetilde{\boldsymbol{\epsilon }}_{-jk}$ as the vector $\boldsymbol{\epsilon }$ except with a zero in the location corresponding to relation $jk$ . We note that the diagonal of the matrix $p_2 \mathcal{S}_2 + p_3 \mathcal{S}_3$ consists of all zeros so that $\mu _{jk}$ is free of $\epsilon _{jk}$ .
We now condition on $y_{jk}$ . For general $z \sim \textrm{N}(\mu, \sigma ^2)$ and $y = \mathbb{1}{[z \gt -\eta ]}$ , we have that
where $\widetilde{\eta } \;:\!=\; (\eta + \mu )/ \sigma$ . Now, taking $z = ( \epsilon _{jk} \, | \,{\boldsymbol{\epsilon }}_{-jk} )$ , we have that
where $\widetilde{\mu }_{jk} : = (\mu _{jk}+{\textbf{x}}^T_{jk}{\boldsymbol{\beta }})/ \sigma _n$ .
We now turn to the outermost conditional expectation in (A2). Substituting the expression for $\mu _{jk}$ into (A5), we have that
This last conditional expectation is difficult to compute in general. Thus, in place of $\widetilde{\mu }_{lm}$ , we substitute its conditional expectation $E[\widetilde{\mu }_{lm} \, | \, \textbf{y} ]$ . Letting $w_{lm} \;:\!=\; E[\epsilon _{lm}\, | \, \textbf{y} ]$ and $\textbf{w}$ be the vector of the expectations $\{ w_{lm} \}_{lm}$ , we define the following nonlinear equation for $\textbf{w}$ :
where we define $\textbf{B} \;:\!=\; -{\sigma _n^2} \!\left ( p_2 \mathcal{S}_2 + p_3 \mathcal{S}_3 \right )$ , $\widetilde{{\textbf{w}}} : = ( \textbf{B}{\textbf{w}} + \textbf{X}{\boldsymbol{\beta }})/ \sigma _n$ , and the functions $\phi (.)$ and $\Phi (.)$ are applied element-wise. The approximation in (A7) refers to the approximation made when replacing $\widetilde{\mu }_{jk}$ with its conditional expectation $E[\widetilde{\mu }_{jk} | \textbf{y} ]$ . We use a Newton-Raphson algorithm to update $\textbf{w}$ (Atkinson, Reference Atkinson2008), initializing the algorithm using the expectation when $\rho =0$ ,
The Newton-Raphson algorithm re-estimates $\textbf{w}$ based on the estimate at iteration $\nu$ , $\widehat{{\textbf{w}}}^{(\nu )}$ , until convergence:
The inverse in (A9) is of a matrix that is not of the form $\sum _{i=1}^3 a_i \mathcal{S}_i$ . To reduce the computational burden of the Newton method updates, we numerically approximate the inverse in (A9). First, we define $v(w_{jk}) = \sigma _n \frac{\phi ( w_{jk} ) ( y_{jk} - \Phi (w_{jk}) ) }{\Phi (w_{jk}) (1 - \Phi (w_{jk}) )}$ , where we define the vector $v({\textbf{w}}) = \{ v(w_{jk}) \}_{jk}$ , and write the derivative
where we define
where we let $\phi _{jk} = \phi (w_{jk})$ and $\Phi _{jk} = \Phi (w_{jk})$ . The term $\textbf{D}\textbf{B}$ arises from differentiating $v({\textbf{w}})$ with respect to $\textbf{w}$ . Using the expression in (A10), we are then able to write the second term in (A9) as
We notice that the matrix $\textbf{I} + \textbf{D}$ is diagonal, but not homogeneous (in which case we compute (A11) directly, with limited computational burden, by exploiting the exchangeable structure). Instead, defining $\textbf{Q} = (1 + \delta )\textbf{I} - \textbf{B}^{-1}$ and $\textbf{M} = \textbf{D} - \delta \textbf{I}$ , which is diagonal, we make the approximation that
which is based on a Neumann series of matrices and relies on the absolute eigenvalues of $\textbf{M}$ being small (Petersen et al., Reference Petersen and Pedersen2008). We choose $\delta$ to be the mean of the minimum and maximum value of $\textbf{D}$ . This choice of $\delta$ minimizes the maximum absolute eigenvalue of $\textbf{M}$ and thus limits the approximation error. Since the inverse of $\textbf{Q}$ may be computed using the exchangeable inversion formula discussed in Appendix B (in $O(1)$ time), the following approximation represents an improvement in computation from $O(n^3)$ to $O(n^2)$ time:
A.3 Approximation to $\rho$ expectation step
The maximization of the expected likelihood with respect to $\rho$ relies on the computation of $\gamma _i = E[\boldsymbol{\epsilon }^T \mathcal{S}_{i} \boldsymbol{\epsilon } \, | \, \textbf{y} ]/ |\Theta _i|$ , for $i \in \{ 1,2,3\}$ (step 2 in Algorithm 1). Under general correlation structure, computation of the expectation $\{ \gamma _i \}_{i=1}^3$ for even small networks is prohibitive. To practically compute $\{ \gamma _{i} \}_{i=1}^3$ , we make two approximations, which we detail in the following subsections: (1) compute expectations conditioning only on the entries in $\textbf{y}$ that correspond to the entries in $\boldsymbol{\epsilon }$ being integrated and (2) approximating these pairwise expectations as linear functions of $\rho$ .
A.3.1 Pairwise expectation
Explicitly, the pairwise approximations to $\{ \gamma _{i} \}_{i=1}^3$ we make are as follows:
where $\Theta _i$ is the set of ordered pairs of relations $(jk, lm)$ which correspond entries in $\mathcal{S}_{i}$ that is 1, for $i \in \{1,2,3 \}$ . These approximations are natural first-order approximations: recalling that $y_{jk} = \mathbb{1}\!\left[\epsilon _{jk} \gt -{\textbf{x}}^T_{jk}{\boldsymbol{\beta }}\right]$ , the approximations in (A12) are based on the notion that knowing the domains of $\epsilon _{jk}$ and $\epsilon _{lm}$ is significantly more informative for $E[ \epsilon _{jk}\epsilon _{lm} \, | \, \textbf{y} ]$ than knowing the domain of, for example, $\epsilon _{ab}$ .
The approximations in (A12) are orders of magnitude faster to compute than the expectations when conditioning on all observations $E[ \epsilon _{jk}\epsilon _{lm} \, | \, \textbf{y} ]$ . In particular, when $i \in \{ 1, 3\}$ , the expectations are available in closed form:
where we define $\eta _{jk} ={\textbf{x}}^T_{jk} \beta$ and the indices $j,k,l$ and $m$ are distinct. When $i=2$ , that is, $ | \{ j, k \} \cap \{l, m \} | = 1$ , the expectation depends on a two dimensional normal probability integral:
where $\bar{\eta }_{jk} = (2y_{jk} - 1)\eta _{jk}$ , for example, and $\bar{\rho } = (2y_{jk} - 1)(2y_{lm} - 1)\rho$ .
A.3.2 Linearization
The computation of $E[ \epsilon _{jk} \epsilon _{lm} \, | \, y_{jk}, y_{lm} ]$ in (A13) requires the computation of $O(n^3)$ bivariate truncated normal integrals $L_{jk,lm}$ , which are not generally available in closed form. We observe empirically, however, that the pairwise approximation to $\gamma _2$ described in Section A.3.1 above, $\gamma _2 \approx \frac{1}{|\Theta _2|}\sum _{jk, lm \in \Theta _2} E[ \epsilon _{jk} \epsilon _{lm} \, | \, y_{jk}, y_{lm} ]$ , is approximately linear in $\rho$ . This linearity is somewhat intuitive, as the sample mean $ \frac{1}{|\Theta _2|}\sum _{jk, lm \in \Theta _2} E[ \epsilon _{jk} \epsilon _{lm} \, | \, y_{jk}, y_{lm} ]$ has expectation equal to $\rho$ , and is thus an asymptotically linear function of $\rho$ . As the sample mean $ \frac{1}{|\Theta _2|}\sum _{jk, lm \in \Theta _2} E[ \epsilon _{jk} \epsilon _{lm} \, | \, y_{jk}, y_{lm} ]$ concentrates around its expectation, it concentrates around a linear function of $\rho$ , and it is reasonable to approximate the sample mean $ \frac{1}{|\Theta _2|}\sum _{jk, lm \in \Theta _2} E[ \epsilon _{jk} \epsilon _{lm} \, | \, y_{jk}, y_{lm} ]$ as a linear function of $\rho$ . To do so, we compute the approximate values of $\gamma _2$ at $\rho =0$ and if $\rho =1$ . In particular,
To compute $c_2$ , we must compute the value of $E[ \epsilon _{jk} \epsilon _{lm} \, | \, y_{jk}, y_{lm} ]$ when $\rho = 1$ . Computing $E[ \epsilon _{jk} \epsilon _{lm} \, | \, y_{jk}, y_{lm} ]$ is simple when the values $y_{jk} = y_{lm}$ , as in this case $E[ \epsilon _{jk} \epsilon _{lm} \, | \, y_{jk}, y_{lm} ] = E\!\left[ \epsilon _{jk}^2 \, | \, y_{jk} = y_{lm}\right ]$ since, when $\rho = 1$ , $\epsilon _{jk} = \epsilon _{lm}$ . Approximations must be made in the cases when $y_{jk} \neq y_{lm}$ . There are two such cases. In the first, there is overlap between the domains of $\epsilon _{jk}$ and $\epsilon _{lm}$ indicated by $y_{jk} = \mathbb{1}[\epsilon _{jk} \gt -\eta _{jk} ]$ and $y_{jk} = \mathbb{1}[\epsilon _{lm} \gt -\eta _{lm}]$ , respectively. We define the domain for $\epsilon _{jk}$ indicated by $y_{jk}$ as $U_{jk} \;:\!=\; \{ u \in{\mathbb{R}} : u \gt (1 - 2y_{jk})\eta _{jk} \}$ . As an example, there is overlap between $U_{jk}$ and $U_{lm}$ when $y_{jk} = 1, y_{lm} = 0$ and $\eta _{lm} \lt \eta _{jk}$ . Then, the desired expectation may be approximated $E[ \epsilon _{jk} \epsilon _{lm} \, | \, y_{jk}, y_{lm} ] \approx E\!\left[ \epsilon _{jk}^2 \, | \, \epsilon _{jk} \in U_{jk} \cap U_{lm}\right ]$ . In the second case, when $y_{jk} \neq y_{lm}$ and $U_{jk} \cap U_{lm} = \varnothing$ , we make the approximation by integrating over the sets $U_{jk}$ and $U_{lm}$ . That is, by taking
To summarize, we compute $c_2$ in (A14) when $\rho =1$ by using the following approximation to $E[\epsilon _{jk} \epsilon _{lm} \, | \, \textbf{y} ] \, \Big |_{\rho =1}$ :
A.4 Missing data
In this subsection, we describe estimation of the PX model in the presence of missing data. We present the maximization of $\ell _{\textbf{y}}$ with respect to $\boldsymbol{\beta }$ first. Second, we discuss maximization of $\ell _{\textbf{y}}$ with respect to $\rho$ . Finally, we give a note on prediction from the PX model when data are missing.
A.4.1 Update $\boldsymbol{\beta }$
To maximize $\ell _{\textbf{y}}$ with respect to $\boldsymbol{\beta }$ (Step 1 of Algorithm 1) in the presence of missing data, we impute the missing values of $\textbf{X}$ and $\textbf{y}$ . We make the decision to impute missing values since much of the speed of estimation of the PX model relies on exploitation of the particular network structure, and, when data are missing, this structure is more difficult to leverage. We impute entries in $\textbf{X}$ with the mean value of the covariates. For example, if $x_{jk}^{(1)}$ is missing, we replace it with the sample mean $\frac{1}{|\mathcal{M}^c|}\sum _{lm \in \mathcal{M}^c} x_{lm}^{(1)}$ , where the superscript $(1)$ refers to the first entry in ${\textbf{x}}_{jk}$ and $\mathcal{M}$ is the set of relations for which data are missing. If $y_{jk}$ is missing, we impute $y_{jk}$ with $\mathbb{1}[w_{jk} \gt -\bar{\eta }]$ , where $\bar{\eta } = \frac{1}{|\mathcal{M}^c|}\sum _{lm \in \mathcal{M}^c}{\textbf{x}}_{lm}^T \widehat{{\boldsymbol{\beta }}}$ and we compute ${\textbf{w}} = E[\boldsymbol{\epsilon } \, | \, \textbf{y}]$ using the procedure in Section A.2. We initialize this procedure at ${\textbf{w}}^{(0)}$ , where any missing entries $jk \in \mathcal{M}$ are initialized with $w^{(0)}_{jk} = 0$ . Given the imputed $\textbf{X}$ and $\textbf{y}$ , the estimation routine may be accomplished as described in Algorithm 1.
A.4.2 Update $\rho$
To maximize $\ell _{\textbf{y}}$ with respect to $\rho$ (Step 2 of Algorithm 1), we approximate $\{ \gamma _i \}_{i=1}^3$ using only observed values. Using the pairwise expressions in (A12), the expressions for the expectation step under missing data are
where we only subsample pairs of relations that are observed such that $\mathcal{A}^{(s)} \subset \Theta _2 \cap \mathcal{M}^c$ . Then, given the values of $\{ \gamma _i \}_{i=1}^3$ in (A15), the maximization of $\ell _{\textbf{y}}$ with respect to $\rho$ (Step 2 in Algorithm 1) may proceed as usual.
A.4.3 Prediction
Joint prediction in the presence of missing data is required for out-of-sample evaluation of the EMM estimator, for example, for cross-validation studies in Section 8. In this setting, model estimation is accomplished by imputing values in $\textbf{X}$ and $\textbf{y}$ earlier in this section under the “Update $\boldsymbol{\beta }$ ” subheading. Then, prediction may be performed by proceeding as described in Section 6 with the full observed $\textbf{X}$ matrix and imputing the missing values in $\textbf{y}$ (again as described above in this section under the “Update $\boldsymbol{\beta }$ ” subheading).
B. Parameters of undirected exchangeable network covariance matrices
In this section, we give a $3\times 3$ matrix equation to invert $\boldsymbol{\Omega }$ rapidly. This equation also gives a basis to compute the partial derivatives $\left \{ \frac{\partial \phi _i}{ \partial p_j} \right \}$ , which we require for the EMM algorithm.
We define an undirected exchangeable network covariance matrix as those square, positive definite matrices of the form
We find empirically that the inverse matrix of any undirected exchangeable network covariance matrix has the same form, that is $\boldsymbol{\Omega }^{-1} = \sum _{i=1}^3{\textbf{p}}_i{\mathcal{S}}_i$ . Using this fact and the particular forms of the binary matrices $\{\mathcal{S}_i \}_{i=1}^3$ , one can see that there are only three possible row-column inner products in the matrix multiplication $\boldsymbol{\Omega } \boldsymbol{\Omega }^{-1}$ , those pertaining to row-column pairs of the form $(ij,ij)$ , $(ij,ik)$ , and $(ij,kl)$ for distinct indices $i,j,k,$ and $l$ . Examining the three products in terms of the parameters in $\boldsymbol{\phi }$ and $\textbf{p}$ , and the fact that $\boldsymbol{\Omega } \boldsymbol{\Omega }^{-1} = \textbf{I}$ , we get the following matrix equation for the parameters $\textbf{p}$ given $\boldsymbol{\phi }$
where the matrix $\textbf{C}(\boldsymbol{\phi })$ is given by
Then, we may invert $\boldsymbol{\Omega }$ with a $3 \times 3$ inverse to find the parameters $\textbf{p}$ of $\boldsymbol{\Omega }^{-1}$ . Explicitly solving these linear equations, the expressions for $\textbf{p}$ are given by
Taking only the largest terms in $n$ , one may approximate the values in $\textbf{p}$ as follows, which will be useful in following theoretical development:
The equation (B1) allows one to compute the partial derivatives $\left \{ \frac{\partial \phi _i}{ \partial p_j} \right \}$ . First, based on (B1), we can write $\textbf{C}(\textbf{p}) \boldsymbol{\phi } = [1,0,0]^T$ . Then, we note that the matrix function $\textbf{C}(\boldsymbol{\phi })$ in (B1) is linear in the terms $\boldsymbol{\phi }$ , and thus, we may write $\textbf{C}({\textbf{p}}) = \sum _{j = 1}^3 p_j \textbf{A}_j^{(n)}$ for some matrices $\left \{ \textbf{A}_j^{(n)} \right \}_{j=1}^3$ that depend on $n$ . Differentiating both sides of $\textbf{C}(\textbf{p}) \boldsymbol{\phi } = [1,0,0]^T$ with respect to $p_j$ and solving gives
which holds for all $j \in \{1,2,3 \}$ .
C. Theoretical support
In this section, we outline proofs suggesting that the estimators resulting from the EMM algorithm are consistent.
C.1 Consistency of $\widehat{\beta }_{\textrm{EMM}}$
The estimator of $\beta$ resulting from the EMM algorithm, $\widehat{\beta }_{\textrm{EMM}}$ , depends on the estimated value of $\rho$ , ${\widehat{\rho }}_{\textrm{EMM}}$ , through the covariance matrix $\boldsymbol{\Omega }$ . Explicitly, given $\boldsymbol{\Omega }$ , the EMM estimator
where $\widehat{E[\textbf{z} \mid \textbf{y}]}$ represents the estimation and approximation of $E[\textbf{z} \mid \textbf{y}]$ described in the EMM algorithm. This estimator is difficult to analyze in general, because, in principle, $\widehat{E[z_{jk} \mid \textbf{y}]}$ depends on every entry in $\textbf{y}$ , and the effects of the approximations are difficult to evaluate. Instead of direct analysis, to evaluate consistency of $\widehat{\beta }_{\textrm{EMM}}$ , we define a bounding estimator that is easier to analyze,
It is immediately clear that $\widehat{\beta }_\textrm{bound}$ is unbiased, since $E[u_{jk}] ={\textbf{x}}_{jk}^T{\boldsymbol{\beta }}$ . Further, the approximations made in the EMM algorithm are meant to bound $\left|\!\left| \widehat{\beta }_{\textrm{EMM}} -{\beta }^*_{\textrm{MLE}} \right|\right|_2^2 \le \!\left|\!\left| \widehat{\beta }_\textrm{bound} -{\beta }^*_{\textrm{MLE}} \right|\right|_2^2$ , where ${\beta }^*_{\textrm{MLE}}$ is the true maximum likelihood estimator. That is, the expectation estimator we compute $\widehat{E[\textbf{z} \mid \textbf{y}]}$ takes into account correlation information through $\Omega$ and is thus closer to the true expectation, $E[\textbf{z} \mid \textbf{y}]$ , than $\textbf{u}$ . Then, we also have that $\widehat{\beta }_{\textrm{EMM}}$ is closer to ${\beta }^*_{\textrm{MLE}}$ than $\widehat{\beta }_\textrm{bound}$ . Then, consistency of $\widehat{\beta }_\textrm{bound}$ implies consistency of $\widehat{\beta }_{\textrm{EMM}}$ , since we assume that the true MLE is consistent.
We now establish consistency of $\widehat{\beta }_\textrm{bound}$ . We make the following assumptions:
-
1. The true model follows a latent variable model,
(C3) \begin{align} &{\mathbb{P}}(y_{ij} = 1 ) ={\mathbb{P}} \!\left ({\textbf{x}}_{ij}^T{\boldsymbol{\beta }} + \epsilon _{ij} \gt 0 \right ), \quad \\ &E[\epsilon _{jk}] = 0. \nonumber \end{align}where $\boldsymbol{\epsilon }$ is not necessarily normally distributed. -
2. The design matrix $\textbf{X}$ is such that the expressions $n^{-(1+i)}\textbf{X}^T \mathcal{S}_i \textbf{X}$ , for $i \in \{1,2,3 \}$ , converge in probability to constant matrices.
-
3. The fourth moments of $\textbf{X}$ and $\boldsymbol{\epsilon }$ are bounded, $||{\textbf{x}}_{jk} ||_4 \le C_1 \lt \infty$ and $E\!\left[\epsilon _{jk}^4\right] \le C_2 \lt \infty$ .
-
4. The estimator of $\rho$ is such that $\Omega ({\widehat{\rho }})$ converges in probability to some positive definite matrix.
-
5. The independence assumption for relations that do not share an actor holds, such that $\epsilon _{jk}$ is independent $\epsilon _{lm}$ whenever actors $j$ , $k$ , $l$ , and $m$ are distinct.
The first assumption defines the meaning of the true coefficient $\beta$ . The second assumption is a standard condition required for most regression problems; a similar condition is required for consistency of any estimator which accounts for correlation in generalized linear model. We evaluate the second assumption in the following section, when we analyze ${\widehat{\rho }}_{\textrm{EMM}}$ . The fourth assumption defines the minimal independence structure.
We start by noticing that $\textbf{u} = \textbf{X}\beta + \epsilon$ , such that
Then, as noted in the previous paragraph, the bounding estimator is unbiased, $E\!\left[\widehat{\beta }_\textrm{bound}\right] = \beta$ . It remains to establish sufficient conditions for which $\widehat{\beta }_\textrm{bound}$ converges to its expectation in probability. Noting the orders of $\{ p_i \}_i$ in (B3), we immediately have that $n^{-2} \textbf{X}^T \Omega ^{-1} \textbf{X}$ converges in probability to a constant. A sufficient condition to establish that $\left (n^{-2} \sum _{i=1}^3 p_i \textbf{X}^T \mathcal{S}_i{\textbf{v}} \right )$ converges in probability to its expectation (zero) is that its variance tends to zero. Expanding this variance expression,
By assumption, every term in the sum expression in (C5) is bounded. Also by assumption, the expectation $E[v_{lm} v_{tu}]$ is zero whenever the relations $lm$ and $tu$ do not share an actor. Using the expressions in (B3) ( $p_i \propto n^2 |\Theta _i|^{-1}$ ) and counting terms,
Thus, the variance of $\widehat{\beta }_\textrm{bound}$ converges to zero, so that $\widehat{\beta }_\textrm{bound}$ converges in probability to the true $\boldsymbol{\beta }$ , as does $\widehat{\beta }_{\textrm{EMM}}$ .
C.2 Consistency of ${\widehat{\rho }}_{\textrm{EMM}}$
Using the expressions in (B3) and differentiating the expected log-likelihood with respect to $\rho$ , the maximum likelihood estimator is
In the EMM algorithm, we approximate the expectations in (C6) using pairwise conditioning. Then, we have that
According to the exchangeability assumption of the errors, the pairwise expectations are known, and the EMM estimator of $\rho$ is unbiased, $E[{\widehat{\rho }}_{\textrm{EMM}}] = E[\epsilon _{jk} \epsilon _{lm}] = \rho$ . The EMM estimator ${\widehat{\rho }}_{\textrm{EMM}}$ converges to its expectation when the sums of conditional expectations in (C7) converge to their expectations. This occurs when the variances of these sums tend to zero. This fact can be established by similar counting arguments as in the previous subsection. For example,
since $E[\epsilon _{jk} \epsilon _{lm} \mid y_{jk}, y_{lm} ]$ is independent $E[\epsilon _{rs} \epsilon _{tu} \mid y_{rs}, y_{tu} ]$ whenever all the indices $\{j,k,l,m,r,s,t,u\}$ are distinct. Thus, each of the sums of expectations in (C7) has variance that tends to zero, so that they converge to their marginal expectations, and ${\widehat{\rho }}_{\textrm{EMM}}$ is consistent.
C.3 Consistency under misspecification
In the discussion of consistency of the EMM estimator, we did not require the assumption of latent normality, nor of exchangeability of the latent errors (we do require a small assumption that the sequence of constants $n^{-3} E[\epsilon ^T \mathcal{S}_2 \epsilon _{lm}]$ converges to some constant on $[0,1/2)$ ). Hence, when the data-generating mechanism is non-Gaussian and non-exchangeable, we expect ${\widehat{\rho }}_{\textrm{EMM}}$ to converge to the pseudo-true $\rho$ . The pseudo-true $\rho$ is the value which minimizes the Kullback-Leibler divergence from the modeled (Gaussian, exchangeable) distribution to the true distribution (Huber, Reference Huber1967; Dhaene, Reference Dhaene1997). In the discussion of consistency of $\widehat{\beta }_{\textrm{EMM}}$ , we only require that ${\widehat{\rho }}_{\textrm{EMM}}$ converges to a fixed value on the interval $[0, 1/2)$ , such that $\Omega (\rho )$ is positive definite. Again, when the data-generating mechanism is non-Gaussian and non-exchangeable, we expect $\widehat{\beta }_{\textrm{EMM}}$ to converge to the pseudo-true $\beta$ . When the true data-generating mechanism is Gaussian (but not necessarily exchangeable), the limiting pseudo-true value for $\widehat{\beta }_{\textrm{EMM}}$ should be the true value.
D. Simulation studies
In this section, we present details pertaining to the second simulation study in Section 7.
D.1 Evaluation of estimation of $\beta$
See Section 7.2 for a description of the simulation study to evaluate performance in estimating $\boldsymbol{\beta }$ . We provide further details in the rest of this paragraph. We generated each $\{ x_{1i} \}_{i=1}^n$ as iid Bernoulli $(1/2)$ random variables, such that the second covariate is an indicator of both $x_{1i} = x_{1j} = 1$ . Each of $\{ x_{2i} \}_{i=1}^n$ and $\{ x_{3ij} \}_{ij}$ were generated from iid standard normal random variables. We fixed ${\boldsymbol{\beta }} = [\beta _0, \beta _1, \beta _2, \beta _3]^T = [-1, 1/2, 1/2, 1/2]^T$ throughout the simulation study. When generating from the latent eigenmodel in (5), we set $\Lambda = \textbf{I}$ , $\sigma ^2_a = 1/6$ , $\sigma ^2_u = 1/\sqrt{6}$ , and $\sigma ^2_{\textbf{x}}i = 1/3$ .
To further investigate the source of poor performance of the amen estimators of the social relations and latent eigenmodels, we computed the bias and the variance of estimators when generating from the PX model and the latent eigenmodel in Figures D1 and D2, respectively. Figures D1 and D2 show that the variances of the amen estimators of the social relations and latent eigenmodels are similar to the PX model, however, that the bias of the amen estimators is substantially larger.
Both the EMM estimator of the PX model and amen estimator of the social relations model provide estimates of $\rho$ . We computed the RMSE for each estimator, for each $\textbf{X}$ realization, when generating from the PX model. In Figure D3, the RMSE plot for $\widehat{\rho }$ shows that the MSE, and the spread of the MSE, decreases with $n$ for the EMM estimator, suggesting that the EMM estimator of $\rho$ is consistent. As with the $\boldsymbol{\beta }$ parameters, the amen estimator displays substantially larger RMSE than the EMM estimator of $\rho$ .
D.2 Remaining coefficients in $t$ simulation
We simulated from the PX model, modified to have heavier-tailed $t_5$ error distribution. The scaled RMSE when estimating all entries in ${\boldsymbol{\beta }}$ is given in Figure D4. All coefficient estimators, for both PX: EMM and standard probit regression, appear consistent, but the PX: EMM has lower RSME.
E. Analysis of political books network
In this section, we present additional predictive results and verify the efficacy of an approximation made by the EMM algorithm when analyzing the political books network data set.
E.1 Prediction performance using ROC AUC
In Section 8, we use area under the precision-recall curve to evaluation predictive performance on the political books network data set. Figure E1 shows the results of the cross-validation study, described in Section 8, as measured by area under the receiver operating characteristic (ROC AUC). The conclusions are the same as those given in Section 8: the PX model appears to account for the inherent correlation in the data with estimation runtimes that are orders of magnitude faster than existing approaches.
E.2 Linear approximation in $\rho$ in EMM algorithm
In Section 5.2, we discuss a series of approximations to the E-step of an EM algorithm to maximize $\ell _{\textbf{y}}$ with respect to $\rho$ . One approximation is a linearization of the sample average $\frac{1}{|\Theta _2|} \sum _{jk, lm \in \Theta _2} E[\epsilon _{jk} \epsilon _{lm} \, | \, y_{jk}, y_{lm} ]$ with respect to $\rho$ . In Figure E2, we confirm that this approximation is reasonable for the political books network data set. Figure E2 shows that the linear approximation to $\frac{1}{|\Theta _2|} \sum _{jk, lm \in \Theta _2} E[\epsilon _{jk} \epsilon _{lm} \, | \, y_{jk}, y_{lm} ]$ (dashed blue line), as described in detail in Section A.3, agrees well with the true average of the pairwise expectations (solid orange line).