Hostname: page-component-745bb68f8f-b95js Total loading time: 0 Render date: 2025-01-07T18:33:53.523Z Has data issue: false hasContentIssue false

Computation for Latent Variable Model Estimation: A Unified Stochastic Proximal Framework

Published online by Cambridge University Press:  01 January 2025

Siliang Zhang
Affiliation:
East China Normal University
Yunxiao Chen*
Affiliation:
London School Of Economics And Political Science
*
Correspondence should be made to Yunxiao Chen, Department of Statistics, London School of Economics and Political Science, London, England. Email: [email protected]
Rights & Permissions [Opens in a new window]

Abstract

Latent variable models have been playing a central role in psychometrics and related fields. In many modern applications, the inference based on latent variable models involves one or several of the following features: (1) the presence of many latent variables, (2) the observed and latent variables being continuous, discrete, or a combination of both, (3) constraints on parameters, and (4) penalties on parameters to impose model parsimony. The estimation often involves maximizing an objective function based on a marginal likelihood/pseudo-likelihood, possibly with constraints and/or penalties on parameters. Solving this optimization problem is highly non-trivial, due to the complexities brought by the features mentioned above. Although several efficient algorithms have been proposed, there lacks a unified computational framework that takes all these features into account. In this paper, we fill the gap. Specifically, we provide a unified formulation for the optimization problem and then propose a quasi-Newton stochastic proximal algorithm. Theoretical properties of the proposed algorithms are established. The computational efficiency and robustness are shown by simulation studies under various settings for latent variable model estimation.

Type
Theory and Methods
Creative Commons
Creative Common License - CCCreative Common License - BY
This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Copyright
Copyright © 2022 The Author(s)

Latent variable models have been playing a central role in psychometrics and related fields. Commonly used latent variable models include item response theory models (Embretson & Reise Reference Embretson and Reise2000; Reckase Reference Reckase2009), latent class models (Clogg, Reference Clogg, Arminger, Clogg and Sobel1995; Rupp et al., Reference Rupp, Templin and Henson2010; von Davier & Lee, Reference von Davier and Lee2019), structural equation models (Bollen, Reference Bollen1989), error-in-variable models (Carroll et al., Reference Carroll, Ruppert, Stefanski and Crainiceanu2006), random-effects models (Hsiao, Reference Hsiao2014), and models for missing data (Little & Rubin, Reference Little and Rubin1987), where latent variables have different interpretations, such as hypothetical constructs, ‘true’ variables measured with error, unobserved heterogeneity, and missing data. We refer the readers to Rabe-Hesketh & Skrondal (Reference Rabe-Hesketh and Skrondal2004) and Bartholomew et al. (Reference Bartholomew, Knott and Moustaki2011) for a comprehensive review of latent variable models.

A latent variable model contains unobserved latent variables and unknown parameters. For example, an item response theory model contains individual-specific latent traits as latent variables and item-specific parameters as model parameters. Comparing with models without latent variables, such as linear regression and generalized linear regression, the estimation of latent variable models is typically more involved. This estimation problem can be viewed from three perspectives: (1) fixed latent variables and parameters, (2) random latent variables and fixed parameters, and (3) random latent variables and parameters.

The first perspective, i.e., fixed latent variables and parameters, leads to the joint maximum likelihood (JML) estimator. This estimator can often be efficiently computed, for example, by an alternating minimization algorithm (Birnbaum, Reference Birnbaum, Lord and Novick1968; Chen et al., Reference Chen, Li and Zhang2019; Reference Chen, Li and Zhang2020). Unfortunately, however, the JML estimator is typically statistically inconsistent (Neyman & Scott, Reference Neyman and Scott1948; Andersen, Reference Andersen1973; Haberman, Reference Haberman1977; Ghosh, Reference Ghosh1995), except under some high-dimensional asymptotic regime that is suitable for large-scale applications (Chen et al., Reference Chen, Li and Zhang2019; Reference Chen, Li and Zhang2020; Haberman, Reference Haberman1977; Reference Haberman2004). Treating both latent variables and parameters as random variables, the third perspective leads to a full Bayesian estimator, for which many Markov chain Monte Carlo (MCMC) algorithms have been developed (e.g., Béguin & Glas, Reference Béguin and Glas2001; Bolt & Lall, Reference Bolt and Lall2003; Dunson, Reference Dunson2000; Reference Dunson2003; Edwards, Reference Edwards2010).

The second perspective, i.e., random latent variables and fixed parameters, essentially follows an empirical Bayes (EB) approach (Robbins, Reference Robbins and Neyman1956; C.-H. Zhang, Reference Zhang2003). This perspective is the most commonly adopted one (Rabe-Hesketh & Skrondal, Reference Rabe-Hesketh and Skrondal2004). Throughout the paper, we refer to estimators derived under this perspective as EB estimators. Both the full-information marginal maximum likelihood (MML) estimator (Bock & Aitkin, Reference Bock and Aitkin1981) and the limited-information composite maximum likelihood (CML) estimator (Jöreskog & Moustaki, Reference Jöreskog and Moustaki2001; Vasdekis et al., Reference Vasdekis, Cagnone and Moustaki2012) can be viewed as special cases. Such estimators involve optimizing an objective function with respect to the fixed parameters, while the objective function is often intractable due to an integral with respect to the latent variables. The most commonly used algorithm for this optimization problem is the expectation-maximization (EM) algorithm (Dempster et al., Reference Dempster, Laird and Rubin1977; Bock & Aitkin, Reference Bock and Aitkin1981). This algorithm typically requires to iteratively evaluate numerical integrals with respective to the latent variables, which is often computationally unaffordable when the dimension of the latent space is high.

A high-dimensional latent space is not the only challenge to the computation of EB estimators. Penalties and constraints on parameters may also involve in the optimization, further complicating the computation. In fact, penalized estimators have become increasingly more popular in latent variable analysis for learning sparse structure, with applications to restricted latent class analysis, exploratory item factor analysis, variable selection in structural equation models, differential item functioning analysis, among others (Chen et al., Reference Chen, Liu, Xu and Ying2015; Sun et al., Reference Sun, Chen, Liu, Ying and Xin2016; Chen et al., Reference Chen, Li, Liu and Ying2018; Lindstrøm & Dahl, Reference Lindstrøm and Dahl2020; Tutz & Schauberger, Reference Tutz and Schauberger2015; Jacobucci et al., Reference Jacobucci, Grimm and McArdle2016; Magis et al., Reference Magis, Tuerlinckx and De Boeck2015). The penalty function is often non-smooth (e.g., Lasso penalty, Tibshirani, Reference Tibshirani1996), for which many standard optimization tools (e.g., gradient descent methods) are not applicable. In addition, complex inequality constraints are also commonly encountered in latent variable estimation, for example, in structural equation models (Van De Schoot et al., Reference Van De Schoot, Hoijtink and Deković2010) and restricted latent class models (e.g., de la Torre, Reference de la Torre2011, Xu, Reference Xu2017). Such complex constraints further complicate the optimization.

In this paper, we propose a quasi-Newton stochastic proximal algorithm that simultaneously tackles the computational challenges mentioned above. This algorithm can be viewed as an extension of the stochastic approximation (SA) method (Robbins & Monro, Reference Robbins and Monro1951). Comparing with SA, the proposed method converges faster and is more robust, thanks to the use of Polyak–Ruppert averaging (Polyak & Juditsky, Reference Polyak and Juditsky1992; Ruppert, Reference Ruppert1988). The proposed method can also be viewed as a stochastic version of a proximal gradient descent algorithm (Chapter 4, Parikh & Boyd, Reference Parikh and Boyd2014), in which constraints and penalties are handled by a proximal update. As will be illustrated by examples later, the proximal update is easy to evaluate for many commonly used penalties and constraints, making the proposed algorithm computationally efficient. Theoretical properties of the proposed method are established, showing that the proposed one is almost optimal in its convergence speed.

The proposed method is closely related to the stochastic-EM algorithm (Celeux, Reference Celeux1985; Ip, Reference Ip2002; Nielsen, Reference Nielsen2000; S. Zhang et al., Reference Zhang, Chen and Liu2020b) and the MCMC stochastic approximation algorithms (Cai, Reference Cai2010a; Reference Caib; Gu & Kong, Reference Gu and Kong1998), two popular methods for latent variable model estimation. Although these methods perform well in many problems, they are not as powerful as the proposed one. Specifically, the MCMC stochastic approximation algorithms cannot handle complex inequality constraints or non-smooth penalties, because they rely on stochastic gradients which do not always exist when there are complex inequality constraints or non-smooth penalties. In addition, as will be discussed later, both the stochastic-EM algorithm and the MCMC stochastic approximation algorithms are computationally less efficient than the proposed method, even for estimation problems without complex constraints or penalties.

The proposed method is also closely related to a perturbed proximal gradient algorithm proposed in Atchadé et al. (Reference Atchadé, Fort and Moulines2017). The current development improves upon that of Atchadé et al. (Reference Atchadé, Fort and Moulines2017) from two aspects. First, the proposed method is a Quasi-Newton method, in which the second-order information (i.e., second derivatives) of the objective function is used in the update. Although this step may only change the asymptotic convergence speed by a constant factor (when the number of iterations grows to infinity), our simulation study suggests that the new method converges much faster than that of Atchadé et al. (Reference Atchadé, Fort and Moulines2017) empirically. Second, the theoretical analysis of Atchadé et al. (Reference Atchadé, Fort and Moulines2017) only considers a convex optimization setting, while we consider a non-convex setting which is typically the case for latent variable model estimation. Note that the analysis is much more involved when the objective function is non-convex. Therefore, our proof of sequence convergence is different from that of Atchadé et al. (Reference Atchadé, Fort and Moulines2017). Specifically, the convergence theory is established by analyzing the convergence of a set-valued generalization of an ordinary differential equation (ODE).

The rest of the paper is organized as follows. In Sect. 1, we formulate latent variable model estimation as a general optimization problem which covers many commonly used estimators as special cases. In Sect. 2, a quasi-Newton stochastic proximal algorithm is proposed. Theoretical properties of the proposed algorithm are established in Sect. 3, suggesting that the proposed algorithm achieves the optimal convergence rate. The performance of the proposed algorithm is demonstrated and compared with other estimators by simulation studies in Sect. 4. We conclude with some discussions in Sect. 5. An R package has been developed that can be found on https://github.com/slzhang-fd/lvmcomp2.

1. Estimation of Latent Variable Models

1.1. Problem Setup

We consider the estimation of a parametric latent variable model. We adopt a general setting, followed by concrete examples in Sects. 1.2 and 1.3. Let Y \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {Y}}$$\end{document} be a random object representing observable data and let y \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {y}}$$\end{document} be its realization. For example, in item factor analysis (IFA), Y \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {Y}}$$\end{document} represents (categorical) responses to all the items from all the respondents. A latent variable model specifies the distribution of Y \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {Y}}$$\end{document} by introducing a set of latent variables ξ Ξ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\xi } \in \Xi $$\end{document} , where Ξ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\Xi $$\end{document} denotes the state space of the latent vector ξ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\xi }$$\end{document} . For example, in item factor analysis, ξ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\xi }$$\end{document} consists of the latent traits of all the respondents and Ξ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\Xi $$\end{document} is a Euclidean space. Let β = ( β 1 , , β p ) B \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\beta } = (\beta _1,\ldots , \beta _p)^\top \in {\mathcal {B}}$$\end{document} be a set of parameters in the model, where B \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathcal {B}}$$\end{document} denotes the parameter space. The goal is to estimate β \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\beta }$$\end{document} given observed data y \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {y}}$$\end{document} .

We consider an EB estimator which takes the form

(1) l ( β ) = log Ξ f ( y , ξ β ) d ξ , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} l(\varvec{\beta }) = \log \left( \int _{\Xi } f({\mathbf {y}}, \varvec{\xi }\mid \varvec{\beta })d \varvec{\xi }\right) , \end{aligned}$$\end{document}

where f ( y , ξ β ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$f({\mathbf {y}}, \varvec{\xi }\mid \varvec{\beta })$$\end{document} is a complete-data likelihood/pseudo-likelihood function that has an analytic form. We assume that the objective function l ( β ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$l(\varvec{\beta })$$\end{document} is finite for any β B \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\varvec{\beta }}\in {\mathcal {B}}$$\end{document} and is also smooth in β \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\beta }$$\end{document} .

The estimator is given by solving the following optimization problem

(2) β ^ = arg min β B - l ( β ) + R ( β ) , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} {\hat{{\varvec{\beta }}}} = \mathop {\hbox {arg min}}\limits _{\varvec{\beta } \in {\mathcal {B}}} -l(\varvec{\beta }) + R(\varvec{\beta }), \end{aligned}$$\end{document}

where R ( β ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$R(\varvec{\beta })$$\end{document} is a penalty function that has an analytic form, such as Lasso, ridge, or elastic net regularization functions. Note that the penalty function often depends on tuning parameters. Throughout this paper, we assume these tuning parameters are fixed and thus do not explicitly indicate them in the objective function (2). In practice, tuning parameters are often unknown and need to be chosen by cross-validation or certain information criterion. We point out that many commonly used estimators take the form of (2), including the MML estimator, the CML estimator, and regularized estimators based on the MML and CML. We also point out that despite its general applicability to latent variable estimation problems, the proposed method is more useful for complex problems that cannot be easily solved by the classical EM algorithm. For certain problems, such as the estimation of linear factor models and simple latent class models, both the E- and M-step of the EM algorithm have closed-form solutions. In that situation, the classical EM algorithm may be computationally more efficient, though the proposed method can still be used.

1.2. High-dimensional Item Factor Analysis

Item factor analysis models are commonly used in social and behavioral sciences for analyzing categorical response data. For exposition, we focus on binary response data and point out that the extension to ordinal response data is straightforward. Consider N individuals responding to J binary-scored items. Let Y ij { 0 , 1 } \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$Y_{ij}\in \{0, 1\}$$\end{document} be a random variable denoting person i’s response to item j and let y ij \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$y_{ij}$$\end{document} be its realization. Thus, we have Y = ( Y ij ) N × J \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {Y}}= (Y_{ij})_{N\times J}$$\end{document} and y = ( y ij ) N × J \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {y}}= (y_{ij})_{N\times J}$$\end{document} , where Y \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {Y}}$$\end{document} and y \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {y}}$$\end{document} are the generic notations introduced in Sect. 1.1 for our data. A comprehensive review of IFA models and their estimation can be found in Chen & Zhang (Reference Chen, Zhang, Zhao and Chen2020a).

It is assumed that the dependence among an individual’s responses is driven by a set of latent factors, denoted by ξ = ( ξ ik ) N × K \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\xi } = (\xi _{ik})_{N\times K}$$\end{document} , where ξ ik \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\xi _{ik}$$\end{document} represents person i’s kth factor. Recall that ξ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\xi }$$\end{document} is our generic notation for the latent variables in Sect. 1.1 and here the state space Ξ = R N × K \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\Xi = {\mathbb {R}}^{N\times K}$$\end{document} . Throughout this paper, we assume the number of factors K is known.

An IFA model makes the following assumptions:

  1. 1. ξ i = ( ξ i 1 , , ξ iK ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\xi }_i = (\xi _{i1},\ldots , \xi _{iK})^\top $$\end{document} , i = 1 , , N \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$i = 1,\ldots , N$$\end{document} , are independent and identically distributed (i.i.d.) random vectors, following a multivariate normal distribution N ( 0 , Σ ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$N({\mathbf {0}}, \varvec{\Sigma })$$\end{document} . The diagonal terms of Σ = ( σ k k ) K × K \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\Sigma } = (\sigma _{kk'})_{K\times K}$$\end{document} are set to one for model identification. As Σ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\Sigma }$$\end{document} is a positive semi-definite matrix, it is common to reparametrize Σ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\Sigma }$$\end{document} by Cholesky decomposition,

    Σ = B B , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} \varvec{\Sigma } = {\mathbf {B}}{\mathbf {B}}^\top , \end{aligned}$$\end{document}
    where B = ( b k k ) K × K \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {B}} = (b_{kk'})_{K\times K}$$\end{document} is a lower triangular matrix. Let b k \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {b}}_k$$\end{document} be the kth row of B \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {B}}$$\end{document} . Then b k = 1 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\Vert {\mathbf {b}}_k \Vert = 1$$\end{document} , k = 1 , , K \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$k=1,\ldots , K$$\end{document} , since the diagonal terms of Σ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\Sigma }$$\end{document} are constrained to value 1.
  2. 2. Y ij \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$Y_{ij}$$\end{document} given ξ i \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\xi }_i$$\end{document} follows a Bernoulli distribution satisfying

    (3) P ( Y ij = 1 ξ i ) = exp ( d j + a j ξ i ) 1 + exp ( d j + a j ξ i ) , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} {\mathbb {P}}(Y_{ij} = 1\mid \varvec{\xi }_i) = \frac{\exp (d_j + {\mathbf {a}}_j^\top \varvec{\xi }_i)}{1+\exp (d_j + {\mathbf {a}}_j^\top \varvec{\xi }_i)}, \end{aligned}$$\end{document}
    where d j \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$d_j$$\end{document} and a j = ( a j 1 , , a jK ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {a}}_j = (a_{j1},\ldots , a_{jK})^\top $$\end{document} are item-specific parameters. The parameters a jk \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$a_{jk}$$\end{document} are often known as the loading parameters.
  3. 3. Y i 1 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$Y_{i1}$$\end{document} ,..., Y iJ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$Y_{iJ}$$\end{document} are assumed to be conditionally independent given ξ i \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\xi }_i$$\end{document} , which is known as the local independence assumption.

Note that we consider the most commonly used logistic model in (3). It is worth pointing out that the proposed algorithm also applies to the normal ogive (i.e., probit) model which assumes that P ( Y ij = 1 ξ i ) = Φ ( d j + a j ξ i ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$ {\mathbb {P}}(Y_{ij} = 1\mid \varvec{\xi }_i) =\Phi (d_j + {\mathbf {a}}_j^\top \varvec{\xi }_i)$$\end{document} . Under the current setting and using the reparametrization for Σ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\Sigma }$$\end{document} , our model parameters are β = { B , d j , a j , j = 1 , , J } \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\varvec{\beta }}= \{{\mathbf {B}}, d_j, {\mathbf {a}}_j, j =1,\ldots , J\}$$\end{document} . The marginal likelihood function takes the form

(4) l ( β ) = i = 1 N log x R K j = 1 J exp [ y ij ( d j + a j x ) ] 1 + exp ( d j + a j x ) ϕ ( x B ) d x , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} l({\varvec{\beta }}) = \sum _{i=1}^N \log \left( \int _{{\mathbf {x}}\in {\mathbb {R}}^{K}} \prod _{j=1}^J \frac{\exp [y_{ij}(d_j + {\mathbf {a}}_j^\top {\varvec{x}})]}{1+\exp (d_j + {\mathbf {a}}_j^\top {\varvec{x}})} \phi ({\mathbf {x}}\mid {\mathbf {B}}) d{\mathbf {x}}\right) , \end{aligned}$$\end{document}

where ϕ ( x B ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\phi ({\mathbf {x}}\mid {\mathbf {B}})$$\end{document} is the density function for multivariate normal distribution N ( 0 , B B ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$N({\mathbf {0}}, {\mathbf {B}}{\mathbf {B}}^\top )$$\end{document} . The K-dimensional integrals involved in (4) cause a high computational burden for a relatively large K (e.g., K 5 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$K \ge 5$$\end{document} ).

IFA models are commonly used for both exploratory and confirmatory analyses. In exploratory IFA, an important problem is to learn a sparse loading matrix ( a ij ) J × K \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$(a_{ij})_{J\times K}$$\end{document} from data, which facilitates the interpretation of the factors. One approach is by the L 1 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$L_1$$\end{document} -regularized estimator (Sun et al., Reference Sun, Chen, Liu, Ying and Xin2016) which takes the form

(5) β ^ = arg min β B - l ( β ) + R ( β ) , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} \begin{aligned} {\hat{{\varvec{\beta }}}} =&\mathop {\hbox {arg min}}\limits _{{\varvec{\beta }}\in {\mathcal {B}}} - l({\varvec{\beta }}) + R({\varvec{\beta }}), \end{aligned} \end{aligned}$$\end{document}

where the parameter space

B = { β R p : b k k = 0 , 1 k < k K , k = 1 K b k k 2 = 1 , k = 1 , , K } , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} {\mathcal {B}} = \{{\varvec{\beta }}\in {\mathbb {R}}^{p}: b_{kk'} = 0, 1\le k < k' \le K, \sum _{k' = 1}^K b_{kk'}^2 = 1, k = 1,\ldots , K\}, \end{aligned}$$\end{document}

and the penalty term

(6) R ( β ) = λ j = 1 J k = 1 K | a jk | . \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} R({\varvec{\beta }}) = \lambda \sum _{j = 1}^J\sum _{k=1}^K \vert a_{jk}\vert . \end{aligned}$$\end{document}

In R ( β ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$R({\varvec{\beta }})$$\end{document} , λ > 0 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\lambda > 0$$\end{document} is a tuning parameter assumed to be fixed throughout this paper. This regularized estimator resolves the rotational indeterminacy issue in exploratory IFA, as the L 1 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$L_1$$\end{document} penalty term is not rotational invariant. Consequently, under mild regularity conditions, the loading matrix can be consistently estimated only up to a column swapping. Note that only the B \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {B}}$$\end{document} matrix has constraints, as reflected by the parameter space B \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathcal {B}}$$\end{document} . Here b k k = 0 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$b_{kk'} = 0$$\end{document} is due to that B \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {B}}$$\end{document} is a lower triangle matrix and k = 1 K b k k 2 = 1 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\sum _{k' = 1}^K b_{kk'}^2 = 1$$\end{document} is due to that the diagonal terms of Σ = B B \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\Sigma } = {\mathbf {B}}{\mathbf {B}}^\top $$\end{document} are all 1. We remark that it is possible to replace the L 1 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$L_1$$\end{document} penalty in R ( β ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$R({\varvec{\beta }})$$\end{document} by other penalty functions for imposing sparsity, such as the elastic net penalty (Zou & Hastie, Reference Zou and Hastie2005)

(7) R ( β ) = λ 1 j = 1 J k = 1 K a jk 2 + λ 2 j = 1 J k = 1 K | a jk | , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} R({\varvec{\beta }}) = \lambda _1 \sum _{j = 1}^J\sum _{k=1}^K a_{jk}^2 + \lambda _2 \sum _{j = 1}^J\sum _{k=1}^K \vert a_{jk}\vert , \end{aligned}$$\end{document}

where λ 1 , λ 2 > 0 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\lambda _1, \lambda _2 > 0$$\end{document} are two tuning parameters.

In confirmatory IFA, zero constraints are imposed on loading parameters, based on prior knowledge about the measurement design. More precisely, these zero constraints can be coded by a binary matrix Q = ( q jk ) J × K \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {Q}} = (q_{jk})_{J\times K}$$\end{document} . If q jk = 0 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$q_{jk} = 0$$\end{document} , then item j does not load on factor k and a jk \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$a_{jk}$$\end{document} is set to 0. Otherwise, a jk \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$a_{jk}$$\end{document} is freely estimated. These constraints lead to parameter space B = { β : b k k = 0 , 1 k < k K ; k = 1 K b k k 2 = 1 , a jk = 0 for q jk = 0 , j = 1 , , J , k = 1 , , K } \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathcal {B}} = \{{\varvec{\beta }}: b_{kk'} = 0, 1\le k < k' \le K; \sum _{k' = 1}^K b_{kk'}^2 = 1, a_{jk} = 0 ~\text{ for }~ q_{jk} = 0, j = 1,\ldots ,J, k = 1,\ldots , K\}$$\end{document} . The MML estimator for confirmatory IFA is then given by

(8) β ^ = arg min β B - l ( β ) . \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} \begin{aligned} {\hat{{\varvec{\beta }}}} =&\mathop {\hbox {arg min}}\limits _{{\varvec{\beta }}\in {\mathcal {B}}} - l({\varvec{\beta }}). \end{aligned} \end{aligned}$$\end{document}

Besides parameter estimation, another problem of interest in confirmatory IFA is to make statistical inference, for which it is required to compute the asymptotic variance of β ^ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\hat{{\varvec{\beta }}}}$$\end{document} . The estimation of the asymptotic variance often requires to compute the Hessian matrix of l ( β ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$l({\varvec{\beta }})$$\end{document} at β ^ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\hat{{\varvec{\beta }}}}$$\end{document} , which also involves intractable K-dimensional integrals. As we will see in Sect. 2.1, this Hessian matrix, as well as quantities taking a similar form, can be easily obtained as a by-product of the proposed algorithm.

1.3. Restricted Latent Class Model

Our second example is restricted latent class models which are also widely used in social and behavioral sciences. For example, they are commonly used in education for cognitive diagnosis (von Davier & Lee, Reference von Davier and Lee2019). These models differ from IFA models in that they assume discrete latent variables. Here, we consider a setting for cognitive diagnosis when both data and latent variables are binary. Consider data taking the same form as that for IFA, denoted by Y = ( Y ij ) N × J \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {Y}}= (Y_{ij})_{N\times J}$$\end{document} and y = ( y ij ) N × J \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {y}}= (y_{ij})_{N\times J}$$\end{document} . In this context, Y ij = 1 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$Y_{ij} = 1$$\end{document} means that item j is answered correctly and Y ij = 0 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$Y_{ij} = 0$$\end{document} means an incorrect answer.

The restricted latent class model assumes that each individual is characterized by a K-dimensional latent vector ξ i = ( ξ i 1 , , ξ iK ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\xi }_i = (\xi _{i1},\ldots , \xi _{iK})^\top $$\end{document} , i = 1 , , N \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$i = 1,\ldots , N$$\end{document} , where ξ ik { 0 , 1 } \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\xi _{ik} \in \{0, 1\}$$\end{document} . Thus, the latent variables are ξ = ( ξ ik ) N × K \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\xi } = (\xi _{ik})_{N\times K}$$\end{document} , whose state space Ξ = { 0 , 1 } N × K \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\Xi = \{0,1\}^{N\times K}$$\end{document} contains all N × K \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$N\times K$$\end{document} binary matrices. Each dimension of ξ i \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\xi }_i$$\end{document} represents a skill, and ξ ik = 1 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\xi _{ik} = 1$$\end{document} indicates that person i has mastered the kth skill and ξ ik = 0 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\xi _{ik} = 0$$\end{document} otherwise.

The restricted latent class model can be parametrized as follows.

  1. 1. The person-specific latent vectors ξ i \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\xi }_i$$\end{document} , i = 1 , , N \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$i = 1,\ldots , N$$\end{document} , are i.i.d., following a categorical distribution satisfying

    P ( ξ i = α ) = exp ( ν α ) α { 0 , 1 } K exp ( ν α ) , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} {\mathbb {P}}(\varvec{\xi }_i = \varvec{\alpha }) = \frac{\exp (\nu _{\varvec{\alpha }})}{\sum _{\varvec{\alpha }' \in \{0, 1\}^K}\exp (\nu _{\varvec{\alpha }'})}, \end{aligned}$$\end{document}
    where α { 0 , 1 } K \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\alpha } \in \{0, 1\}^K$$\end{document} represents an attribute profile representing the mastery status on all K attributes, and we set ν α = 0 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\nu _{\varvec{\alpha }'} = 0$$\end{document} as the baseline, for α = ( 0 , , 0 ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\alpha }' = (0,\ldots , 0)^\top $$\end{document} .
  2. 2. Y ij \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$Y_{ij}$$\end{document} given ξ i \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\xi }_i$$\end{document} follows a Bernoulli distribution, satisfying

    P ( Y ij = 1 ξ i = α ) = exp ( θ j , α ) 1 + exp ( θ j , α ) , α { 0 , 1 } K . \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} {\mathbb {P}}(Y_{ij} = 1\mid \varvec{\xi }_i = \varvec{\alpha }) = \frac{\exp (\theta _{j, \varvec{\alpha }})}{1+\exp (\theta _{j, \varvec{\alpha }})}, ~\varvec{\alpha } \in \{0, 1\}^K. \end{aligned}$$\end{document}
  3. 3. Local independence is still assumed. That is, Y i 1 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$Y_{i1}$$\end{document} ,..., Y iJ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$Y_{iJ}$$\end{document} are conditionally independent given ξ i \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\xi }_i$$\end{document} .

The above model specification leads to a marginal likelihood function

(9) l ( β ) = i = 1 N log α { 0 , 1 } K exp ( ν α ) α { 0 , 1 } K exp ( ν α ) j = 1 J exp ( y ij θ j , α ) 1 + exp ( θ j , α ) , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} l({\varvec{\beta }}) = \sum _{i=1}^N \log \left( \sum _{\varvec{\alpha }\in \{0,1\}^K} \frac{\exp (\nu _{\varvec{\alpha }})}{\sum _{\varvec{\alpha }' \in \{0, 1\}^K}\exp (\nu _{\varvec{\alpha }'})} \prod _{j=1}^J \frac{\exp (y_{ij}\theta _{j, \varvec{\alpha }})}{1+\exp (\theta _{j, \varvec{\alpha }})} \right) , \end{aligned}$$\end{document}

where β = { ν α , θ j , α , α { 0 , 1 } K , j = 1 , , J } \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\varvec{\beta }}= \{\nu _{\varvec{\alpha }}, \theta _{j, \varvec{\alpha }}, \varvec{\alpha }\in \{0, 1\}^K, j = 1,\ldots , J\}$$\end{document} .

We consider a confirmatory setting where there exists a design matrix, similar to the Q \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {Q}}$$\end{document} -matrix in confirmatory IFA. With slight abuse of notation, we still denote Q = ( q jk ) J × K \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {Q}} = (q_{jk})_{J\times K}$$\end{document} , where q jk { 0 , 1 } \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$q_{jk} \in \{0, 1\}$$\end{document} . Here, q jk = 1 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$q_{jk} = 1$$\end{document} indicates that solving item j requires the kth skill and q jk = 0 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$q_{jk} = 0$$\end{document} otherwise. As will be explained below, this design matrix leads to equality and inequality constraints in model parameters.

Denote q j = ( q j 1 , , q jK ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {q}}_{j} = (q_{j1},\ldots , q_{jK})^\top $$\end{document} as the design vector for item j. For α = ( α 1 , , α K ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\alpha } = (\alpha _1,\ldots , \alpha _K)^\top $$\end{document} , we write

α q j , if α k q jk for all k { 1 , , K } , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} \varvec{\alpha } \succeq {\mathbf {q}}_{j}, ~\text{ if }~ \alpha _k \ge q_{jk} ~\text{ for } \text{ all }~ k \in \{1,\ldots , K\}, \end{aligned}$$\end{document}

and write

α ⪰̸ q j , if there exists k such that α k < q jk . \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} \varvec{\alpha } \nsucceq {\mathbf {q}}_{j}, ~\text{ if } \text{ there } \text{ exists } k \text{ such } \text{ that }~ \alpha _k < q_{jk}. \end{aligned}$$\end{document}

That is, α q j \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\alpha } \succeq {\mathbf {q}}_{j}$$\end{document} if profile α \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\alpha }$$\end{document} has all the skills needed for solving item j and α ⪰̸ q j \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\alpha } \nsucceq {\mathbf {q}}_{j}$$\end{document} if not. The design information leads to the following constraints:

  1. 1. P ( Y ij = 1 ξ i = α ) = P ( Y ij = 1 ξ i = α ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbb {P}}(Y_{ij} = 1\mid \varvec{\xi }_i = \varvec{\alpha }) = {\mathbb {P}}(Y_{ij} = 1\mid \varvec{\xi }_i = \varvec{\alpha }')$$\end{document} , if both α , α q j \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\alpha }, \varvec{\alpha }' \succeq {\mathbf {q}}_{j}$$\end{document} . That is, individuals who have mastered all the required skills have the same chance of answering the item correctly.

  2. 2. P ( Y ij = 1 ξ i = α ) P ( Y ij = 1 ξ i = α ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbb {P}}(Y_{ij} = 1\mid \varvec{\xi }_i = \varvec{\alpha }) \ge {\mathbb {P}}(Y_{ij} = 1\mid \varvec{\xi }_i = \varvec{\alpha }')$$\end{document} if α q j \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\alpha }\succeq {\mathbf {q}}_{j}$$\end{document} and α ⪰̸ q j \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\alpha }'\nsucceq {\mathbf {q}}_{j}$$\end{document} . That is, students who have mastered all the required skills have a higher chance of answering the item correctly than those who do not.

  3. 3. P ( Y ij = 1 ξ i = α ) P ( Y ij = 1 ξ i = 0 ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbb {P}}(Y_{ij} = 1\mid \varvec{\xi }_i=\varvec{\alpha })\ge {\mathbb {P}}(Y_{ij}=1\mid \varvec{\xi }_i={\varvec{0}})$$\end{document} for all α \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\alpha }$$\end{document} . That is, students who have not mastered any skill have the lowest chance of answering correctly.

We refer the readers to Xu (Reference Xu2017) for more discussions on these constraints which are key to the identification of this model. Under these constraints, the MML estimator is given by

(10) β ^ = arg min β B - l ( β ) , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} \begin{aligned} {\hat{{\varvec{\beta }}}} =&\mathop {\hbox {arg min}}\limits _{{\varvec{\beta }}\in {\mathcal {B}}} - l({\varvec{\beta }}), \end{aligned} \end{aligned}$$\end{document}

where

B = { β : max α q j θ j , α = min α q j θ j , α θ j , α θ j , 0 , if α ⪰̸ q j , ν 0 = 0 } . \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} \begin{aligned} {\mathcal {B}} = \{{\varvec{\beta }}: \max _{\varvec{\alpha }\succeq {\mathbf {q}}_{j}} \theta _{j, \varvec{\alpha }} = \min _{\varvec{\alpha }\succeq {\mathbf {q}}_{j}} \theta _{j, \varvec{\alpha }} \ge \theta _{j, \varvec{\alpha }'} \ge \theta _{j,{\varvec{0}}}, ~\text{ if }~ \varvec{\alpha }^\prime \nsucceq \mathbf {q}_j, \,\, \nu _{{\varvec{0}}} = 0\}. \end{aligned} \end{aligned}$$\end{document}

When K is relatively large, the computation for solving (10) becomes challenging, due to both the summation over 2 K \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$2^K$$\end{document} possible values of α \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\alpha }$$\end{document} in l ( β ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$l({\varvec{\beta }})$$\end{document} , and the large number of inequality constraints.

2. Stochastic Proximal Algorithm

In this section, we propose a quasi-Newton stochastic proximal algorithm for the computation of (2). The description in this section will focus on the computation aspect, without emphasizing the regularity conditions needed for its convergence. A rigorous theoretical treatment will be given in Sect. 3. In what follows, we describe the algorithm in its general form in Sect. 2.1, followed by details for two specific models in Sects. 2.2 and 2.3, and finally comparisons with related algorithms in Sect. 2.4.

2.1. General Algorithm

For ease of exposition, we introduce some new notation. We write the penalty function as the sum of two terms, R ( β ) = R 1 ( β ) + R 2 ( β ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$R({\varvec{\beta }}) = R_1({\varvec{\beta }}) + R_2({\varvec{\beta }})$$\end{document} , where R 1 ( β ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$R_1({\varvec{\beta }})$$\end{document} is a smooth function and R 2 ( β ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$R_2({\varvec{\beta }})$$\end{document} is non-smooth. In the example of regularized estimation for exploratory IFA, R 1 ( β ) = 0 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$R_1({\varvec{\beta }}) = 0$$\end{document} and R 2 ( β ) = λ j = 1 J k = 1 K | a jk | \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$R_2({\varvec{\beta }}) = \lambda \sum _{j = 1}^J\sum _{k=1}^K \vert a_{jk}\vert $$\end{document} , when R ( β ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$R({\varvec{\beta }})$$\end{document} is an L 1 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$L_1$$\end{document} penalty as in (6). When an elastic net penalty is used as in (7), R 1 ( β ) = λ 1 j = 1 J k = 1 K a jk 2 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$R_1({\varvec{\beta }}) = \lambda _1 \sum _{j = 1}^J\sum _{k=1}^K a_{jk}^2 $$\end{document} and R 2 ( β ) = λ 2 j = 1 J k = 1 K | a jk | \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$R_2({\varvec{\beta }}) = \lambda _2 \sum _{j = 1}^J\sum _{k=1}^K \vert a_{jk}\vert $$\end{document} .

The optimization problem can be reexpressed as

(11) min β h ( β ) + g ( β ) , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} \min _{\varvec{\beta }}~ h({\varvec{\beta }}) + g({\varvec{\beta }}), \end{aligned}$$\end{document}

where h ( β ) = - l ( β ) + R 1 ( β ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$h({\varvec{\beta }}) = -l(\varvec{\beta }) + R_1(\varvec{\beta })$$\end{document} and g : R p R { + } \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$g: {\mathbb {R}}^{p} \rightarrow {\mathbb {R}} \cup \{+\infty \}$$\end{document} is a generalized function taking the form g ( β ) = R 2 ( β ) + I B ( β ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$g({\varvec{\beta }}) = R_2({\varvec{\beta }}) + I_{{\mathcal {B}}}({\varvec{\beta }})$$\end{document} , where

(12) I B ( β ) = 0 , if β B , + , otherwise. \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} I_{{\mathcal {B}}}({\varvec{\beta }}) = \left\{ \begin{array}{ll} 0, &{} ~\text{ if }~ {\varvec{\beta }}\in {\mathcal {B}},\\ +\infty , &{} ~\text{ otherwise. }~ \end{array}\right. \end{aligned}$$\end{document}

Note that since both l ( β ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$l({\varvec{\beta }})$$\end{document} and R 1 ( β ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$R_1({\varvec{\beta }})$$\end{document} are smooth in β \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\varvec{\beta }}$$\end{document} , h ( β ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$h({\varvec{\beta }})$$\end{document} is still smooth in β \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\varvec{\beta }}$$\end{document} . The second term g ( β ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$g({\varvec{\beta }})$$\end{document} is non-smooth in β \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\varvec{\beta }}$$\end{document} , unless it is degenerate (i.e., g ( β ) 0 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$g({\varvec{\beta }}) \equiv 0$$\end{document} ). We further write

(13) H ( ξ , β ) = - log f ( y , ξ β ) + R 1 ( β ) , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} H(\varvec{\xi }, {\varvec{\beta }}) = - \log f({\mathbf {y}}, \varvec{\xi }\mid {\varvec{\beta }}) + R_1({\varvec{\beta }}), \end{aligned}$$\end{document}

which can be viewed as a complete-data version of h ( β ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$h({\varvec{\beta }})$$\end{document} that will be used in the algorithm.

The algorithm relies on a scaled proximal operator (Lee et al., Reference Lee, Sun and Saunders2014) for the g function, defined as

Prox γ , g D ( β ) = arg min x R p g ( x ) + 1 2 γ x - β D 2 , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} \text {Prox}_{\gamma , g}^{{\mathbf {D}}}({\varvec{\beta }}) = \mathop {\hbox {arg min}}\limits _{{\mathbf {x}}\in {\mathbb {R}}^{p}} \left\{ g({\mathbf {x}}) + \frac{1}{2\gamma } \Vert {\mathbf {x}}- {\varvec{\beta }}\Vert ^2_{{\mathbf {D}}} \right\} , \end{aligned}$$\end{document}

where γ > 0 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\gamma >0$$\end{document} , D \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {D}}$$\end{document} is a strictly positive definite matrix, and · D \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\Vert \cdot \Vert _{{\mathbf {D}}}$$\end{document} is a norm defined by x D 2 = x , x D = x D x \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\Vert {\mathbf {x}}\Vert _{{\mathbf {D}}}^2 = \langle {\varvec{x}},{\varvec{x}}\rangle _{{\mathbf {D}}}= {\mathbf {x}}^\top {\mathbf {D}} {\mathbf {x}}$$\end{document} . The choices of γ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\gamma $$\end{document} , D \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {D}}$$\end{document} , and the intuition behind the proximal operator will be explained in the sequel.

Our general algorithm is described in Algorithm 1, followed by implementation details. The proposed algorithm is an extension of a perturbed proximal gradient algorithm (Atchadé et al., Reference Atchadé, Fort and Moulines2017). The major difference is that the proposed algorithm makes use of second-order information from the smooth part of the objective function, which can substantially speed up its convergence. See Sect. 2.4 for further comparison.

Algorithm 1

(Stochastic Proximal Algorithm)

  • Input Data y \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {y}}$$\end{document} , initial parameters β ( 0 ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\varvec{\beta }}^{(0)}$$\end{document} , a sequence of step size γ s , s = 1 , 2 , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\gamma _s, s= 1, 2,\ldots $$\end{document} , pre-specified tuning parameters c 2 c 1 > 0 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$c_{2} \ge c_{1} > 0$$\end{document} , and burn-in size ϖ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varpi $$\end{document} .

  • Update At tth iteration where t 1 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$t \ge 1$$\end{document} , we perform the following two steps:

    1. 1. Stochastic step Sample ξ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\xi }$$\end{document} from the conditional distribution of ξ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\xi }$$\end{document} given y \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {y}}$$\end{document} ,

      ψ ( ξ ) f ( y , ξ β ( t - 1 ) ) , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} \psi (\varvec{\xi }) \propto f({\mathbf {y}}, \varvec{\xi }\mid {\varvec{\beta }}^{(t-1)}), \end{aligned}$$\end{document}
      and obtain ξ ( t ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\xi }^{(t)}$$\end{document} . The sampling can be either exact or approximated by MCMC.
    2. 2. Proximal step Update model parameters by

      (14) β ( t ) = Prox γ t , g D ( t ) ( β ( t - 1 ) - γ t ( D ( t ) ) - 1 G ( t ) ) , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} {\varvec{\beta }}^{(t)} = \text {Prox}_{\gamma _t, g}^{{\mathbf {D}}^{(t)}}\big ({\varvec{\beta }}^{(t-1)} - \gamma _t({\mathbf {D}}^{(t)})^{-1}{\mathbf {G}}^{(t)}\big ), \end{aligned}$$\end{document}
      where
      G ( t ) = H ( ξ ( t ) , β ) β | β = β ( t - 1 ) . \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} {\mathbf {G}}^{(t)} = \frac{\partial H(\varvec{\xi }^{(t)}, {\varvec{\beta }})}{\partial {\varvec{\beta }}} \bigg \vert _{{\varvec{\beta }}= {\varvec{\beta }}^{(t-1)}}. \end{aligned}$$\end{document}
      D ( t ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {D}}^{(t)}$$\end{document} is a diagonal matrix with diagonal entries
      δ i ( t ) = t - 1 t δ i ( t - 1 ) + 1 t T δ ~ i ( t ) ; c 1 , c 2 , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} \delta _i^{(t)} = \frac{t-1}{t} \delta _i^{(t-1)} + \frac{1}{t}T\left( {\tilde{\delta }}_i^{(t)}; c_{1}, c_{2}\right) , \end{aligned}$$\end{document}
      where T ( x ; c 1 , c 2 ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$T(x;c_1,c_2)$$\end{document} is a truncation function defined as
      (15) T ( x ; c 1 , c 2 ) = c 1 , if x < c 1 , x , if x [ c 1 , c 2 ] , c 2 , if x > c 2 . \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} T(x;c_1,c_2) = \left\{ \begin{array}{ll} c_1, &{} ~\text{ if }~ x < c_1, \\ x, &{} ~\text{ if }~ x \in [c_1,c_2],\\ c_2, &{} ~\text{ if }~ x > c_2. \end{array}\right. \end{aligned}$$\end{document}
      Here δ ~ i ( t ) = δ ~ i , 1 ( t ) + ( δ ~ i , 2 ( t ) ) 2 , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\tilde{\delta }}_i^{(t)} = {\tilde{\delta }}_{i,1}^{(t)} + ({\tilde{\delta }}_{i,2}^{(t)})^2,$$\end{document} where
      δ ~ i , 1 ( t ) = ( 1 - γ t ) δ ~ i , 1 ( t - 1 ) + γ t 2 H ( ξ ( t ) , β ) β i 2 | β = β ( t - 1 ) - H ( ξ ( t ) , β ) β i 2 | β = β ( t - 1 ) , δ ~ i , 2 ( t ) = ( 1 - γ t ) δ ~ i , 2 ( t - 1 ) + γ t H ( ξ ( t ) , β ) β i 2 | β = β ( t - 1 ) . \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} {\tilde{\delta }}_{i,1}^{(t)}= & {} (1\!-\!\gamma _t){\tilde{\delta }}_{i,1}^{(t-1)} \!+\! \gamma _t \left( \frac{\partial ^2 H(\varvec{\xi }^{(t)}, {\varvec{\beta }})}{\partial \beta _i^2}\bigg \vert _{{\varvec{\beta }}= {\varvec{\beta }}^{(t-1)}} \!-\! \left( \frac{\partial H(\varvec{\xi }^{(t)},{\varvec{\beta }})}{\partial \beta _i}\right) ^2\bigg \vert _{{\varvec{\beta }}={\varvec{\beta }}^{(t-1)}}\right) ,\\ {\tilde{\delta }}_{i,2}^{(t)}= & {} (1-\gamma _t){\tilde{\delta }}_{i,2}^{(t-1)} + \gamma _t \left( \frac{\partial H(\varvec{\xi }^{(t)},{\varvec{\beta }})}{\partial \beta _i}\right) ^2 \bigg \vert _{{\varvec{\beta }}={\varvec{\beta }}^{(t-1)}}. \end{aligned}$$\end{document}
    Iteratively perform these two steps until a stopping criterion is satisfied and let n be the last iteration number.

  • Output β ¯ n = t = ϖ + 1 n β ( t ) / ( n - ϖ ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\bar{{\varvec{\beta }}}}_n = {\sum _{t=\varpi +1}^n {\varvec{\beta }}^{(t)}}/(n-\varpi )$$\end{document} .

In what follows, we make a few remarks to provide some intuitions about the algorithm.

Remark 1

(Connection with stochastic gradient descent) To provide some intuitions about the proposed method, we first make a connection between the proposed method and the stochastic gradient descent (SGD) algorithm. In fact, when the sampling of ξ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\xi }$$\end{document} is exact in the stochastic step, then G ( t ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {G}}^{(t)}$$\end{document} is a stochastic gradient of the smooth part of our objective function, in the sense that E ( G ( t ) y , β ( t - 1 ) ) = h ( β ) | β = β ( t - 1 ) . \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbb {E}}({\mathbf {G}}^{(t)}\mid {\mathbf {y}}, {\varvec{\beta }}^{(t-1)}) = \nabla h({\varvec{\beta }})\vert _{{\varvec{\beta }}= {\varvec{\beta }}^{(t-1)}}.$$\end{document} If, in addition, there is no constraint or non-smooth penalty, i.e., g ( β ) 0 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$g({\varvec{\beta }}) \equiv 0$$\end{document} , then the proximal step degenerates to an SGD update β ( t ) = β ( t - 1 ) - γ t ( D ( t ) ) - 1 G ( t ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\varvec{\beta }}^{(t)} = {\varvec{\beta }}^{(t-1)} - \gamma _t({\mathbf {D}}^{(t)})^{-1}{\mathbf {G}}^{(t)}$$\end{document} . In that case, the proposed method becomes a version of SGD.

Remark 2

(Proximal step) We provide some intuitions about the proximal step. We start with two special cases. First, as mentioned in Remark 1, if there is no constraint or non-smooth penalty, then the proximal step is nothing but a stochastic gradient descent step. This is because, the scaled proximal operator degenerates to an identity map, i.e., Prox γ , g D ( β ) = β \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\text {Prox}_{\gamma , g}^{{\mathbf {D}}}({\varvec{\beta }}) = {\varvec{\beta }}$$\end{document} . Second, when the g function involves constraints but does not contain a non-smooth penalty, then the proximal step is a projected stochastic gradient descent step. That is, one first performs a stochastic gradient descent update β ~ ( t ) = β ( t - 1 ) - γ t ( D ( t ) ) - 1 G ( t ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\tilde{{\varvec{\beta }}}}^{(t)} = {\varvec{\beta }}^{(t-1)} - \gamma _t({\mathbf {D}}^{(t)})^{-1}{\mathbf {G}}^{(t)}$$\end{document} . Then β ~ ( t ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\tilde{{\varvec{\beta }}}}^{(t)}$$\end{document} is projected back to the feasible region B \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathcal {B}}$$\end{document} by the scaled proximal operator:

β ^ = arg min β B β - β ~ ( t ) D , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} {\hat{{\varvec{\beta }}}} = \mathop {\hbox {arg min}}\limits _{{\varvec{\beta }}\in {\mathcal {B}}} \Vert {\varvec{\beta }}- {\tilde{{\varvec{\beta }}}}^{(t)}\Vert _{{\mathbf {D}}}, \end{aligned}$$\end{document}

which is a projection under the norm · D \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\Vert \cdot \Vert _{{\mathbf {D}}}$$\end{document} . When D \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {D}}$$\end{document} is an identity matrix as in the vanilla (i.e., non-scaled) proximal operator, then the projection is based on the Euclidean distance.

More generally, when the g function involves non-smooth penalties, then the proximal step can be viewed as minimizing the sum of g ( β ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$g({\varvec{\beta }})$$\end{document} and a quadratic approximation of h ( β ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$h({\varvec{\beta }})$$\end{document} at β ( t - 1 ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\varvec{\beta }}^{(t-1)}$$\end{document} ; see Lee et al. (Reference Lee, Sun and Saunders2014) for more explanations. We provide an example to facilitate the understanding. Suppose that

g ( β ) = λ i = 1 p | β i | \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} g({\varvec{\beta }}) = \lambda \sum _{i=1}^p \vert \beta _i\vert \end{aligned}$$\end{document}

is the Lasso penalty, and D = d i a g ( δ 1 , , δ p ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {D}} = diag(\delta _1,\ldots , \delta _p)$$\end{document} is a diagonal matrix, where λ , δ i > 0 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\lambda , \delta _i > 0$$\end{document} , i = 1 , , p \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$i = 1,\ldots , p$$\end{document} . Then Prox γ , g D ( β ~ ( t ) ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\text {Prox}_{\gamma , g}^{{\mathbf {D}}}({\tilde{{\varvec{\beta }}}}^{(t)})$$\end{document} involves solving p optimization problems separately, each of which takes the form

(16) β ^ i = arg min x 1 2 ( x - β ~ i ( t ) ) 2 + λ γ δ i | x | . \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} {\hat{\beta }}_i = \mathop {\hbox {arg min}}\limits _{x} \frac{1}{2} (x - {\tilde{\beta }}_i^{(t)})^2 + \frac{\lambda \gamma }{\delta _i} |x|. \end{aligned}$$\end{document}

It is well known that (16) has a closed-form solution given by soft-thresholding (see Chapter 3, Friedman et al., Reference Friedman, Hastie and Tibshirani2001):

β ^ i = β ~ i ( t ) - λ γ δ i , if β ~ i ( t ) > λ γ δ i , β ~ i ( t ) + λ γ δ i , if β ~ i ( t ) < - λ γ δ i , 0 , otherwise . \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} {\hat{\beta }}_i = \left\{ \begin{array}{ll} {\tilde{\beta }}_i^{(t)} - \frac{\lambda \gamma }{\delta _i}, &{} ~\text{ if }~ {\tilde{\beta }}_i^{(t)} > \frac{\lambda \gamma }{\delta _i}, \\ {\tilde{\beta }}_i^{(t)} + \frac{\lambda \gamma }{\delta _i}, &{} ~\text{ if }~ {\tilde{\beta }}_i^{(t)} < - \frac{\lambda \gamma }{\delta _i}, \\ 0, &{} ~\text{ otherwise }. \end{array}\right. \end{aligned}$$\end{document}

Remark 3

(Role of D ( t ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {D}}^{(t)}$$\end{document} ) Our proximal step is a quasi-Newton proximal update proposed in Lee et al. (Reference Lee, Sun and Saunders2014) under a non-stochastic optimization setting. As shown in Lee et al. (Reference Lee, Sun and Saunders2014), quasi-Newton proximal methods converge faster than first-order proximal methods under the non-stochastic setting. Here, the diagonal matrix D ( t ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {D}}^{(t)}$$\end{document} is used to approximate the Hessian matrix of h ( β ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$h({\varvec{\beta }})$$\end{document} at β ( t ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\varvec{\beta }}^{(t)}$$\end{document} . When β ( t ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\varvec{\beta }}^{(t)}$$\end{document} converges to β ^ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\hat{{\varvec{\beta }}}}$$\end{document} , then δ i ( t ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\delta _i^{(t)}$$\end{document} , the ith diagonal term of D ( t ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {D}}^{(t)}$$\end{document} converges to T 2 h β i 2 | β = β ^ ; c 1 , c 2 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$T\left( \frac{\partial ^2 h}{\partial \beta _i^2}\vert _{{\varvec{\beta }}= {\hat{{\varvec{\beta }}}}}; c_{1}, c_{2}\right) $$\end{document} where T is the truncation function defined in (15); see Remark 8 for more explanations.

In the proposed update, we choose D ( t ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {D}}^{(t)}$$\end{document} to be a diagonal matrix for computational convenience. Specifically, as discussed in Remark 2, the proximal step is in a closed form when D ( t ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {D}}^{(t)}$$\end{document} is a diagonal matrix. In addition, the proximal step requires to calculate the inverse of D ( t ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {D}}^{(t)}$$\end{document} , whose complexity is much lower when D ( t ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {D}}^{(t)}$$\end{document} is diagonal.

We point out that using a diagonal matrix to approximate the Hessian matrix is a popular and effective trick in numerical optimization (e.g., Chapter 5, Bertsekas et al., Reference Bertsekas, Gallager and Humblet1992; Becker & Le Cun, Reference Becker, Le Cun, Touretzky and Hinton1988), especially for large-scale optimization problems. In principle, it is possible to allow D ( t ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {D}}^{(t)}$$\end{document} to be non-diagonal. In fact, it is not difficult to generalize the BFGS updating formula for D ( t ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {D}}^{(t)}$$\end{document} given in Lee et al. (Reference Lee, Sun and Saunders2014) to a stochastic version.

Our choice of D ( t ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {D}}^{(t)}$$\end{document} guarantees its eigenvalues to be constrained in the interval [ c 1 , c 2 ] \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$[c_{1}, c_{2}]$$\end{document} . It rules out the singular situation when D ( t ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {D}}^{(t)}$$\end{document} is not strictly positive definite. In the implementation, we set c 1 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$c_{1}$$\end{document} ’s to be a sufficiently small constant and set c 2 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$c_{2}$$\end{document} ’s to be a sufficiently large constant. According to simulation, the algorithm tends to be insensitive to these choices.

We further provide some remarks regarding the implementation details.

Remark 4

(Choices of step size) As will be shown in Sect. 3, the convergence of the proposed method requires the step size to satisfy t = 1 γ t = \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\sum _{t=1}^{\infty } \gamma _t = \infty $$\end{document} and t = 1 γ t 2 < \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\sum _{t=1}^{\infty } \gamma _t^2 < \infty $$\end{document} . This requirement is also needed in the Robbins–Monro algorithm. Here, we choose the step size γ t = μ t - 1 2 - ε \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\gamma _t = \mu t^{-\frac{1}{2} - \varepsilon }$$\end{document} so that the above requirement is satisfied, where μ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\mu $$\end{document} is a positive constant and ε \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varepsilon $$\end{document} is a small positive constant. As will be shown in Sect. 3, with sufficiently small ε \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varepsilon $$\end{document} , β ¯ n \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\bar{{\varvec{\beta }}}}_n$$\end{document} is almost optimal in terms of its convergence speed. We point out that ε \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varepsilon $$\end{document} is needed to prove the convergence of β ¯ n \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\bar{{\varvec{\beta }}}}_n$$\end{document} , under our non-convex setting. It is not needed, if the objective function (2) is convex; see Atchadé et al. (Reference Atchadé, Fort and Moulines2017). The requirement of ε \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varepsilon $$\end{document} may be an artifact due to our proof strategy. Simulation results show that the algorithm converges well even if we set ε = 0 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varepsilon = 0$$\end{document} . For the numerical analysis in this paper, we set ε = 10 - 2 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varepsilon = 10^{-2}$$\end{document} .

We point out that our choice of step size is very different from the step size in the Robbins–Monro algorithm, for which asymptotic results (Fabian, Reference Fabian1968) suggest that the optimal choice of step size satisfies γ t = O ( 1 / t ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\gamma _t = O(1/t)$$\end{document} .

Remark 5

(Starting point) As the objective function (2) is typically non-convex for most latent variable models, the choice of the starting point β ( 0 ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\varvec{\beta }}^{(0)}$$\end{document} matters. The algorithm is more likely to converge to the global optimum given a good starting point. One strategy is to run the proposed algorithm with multiple random starting points and then choose the best-fitting solution. Alternatively, one may find a good starting point using less accurate but computationally faster estimators, such as the constrained joint maximum likelihood estimator (Chen et al., Reference Chen, Li and Zhang2019; Reference Chen, Li and Zhang2020) or spectral methods (H. Zhang et al., Reference Zhang, Chen and Li2020a). Moreover, to further avoid convergence to local optima, one may also use multiple random starting points and choose the one with the smallest objective function value.

Remark 6

(Sampling in stochastic step) As mentioned in Remark 1, when the latent variables ξ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\xi }$$\end{document} can be sampled exactly in the stochastic step, then G ( t ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {G}}^{(t)}$$\end{document} is a stochastic gradient of h ( β ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$h({\varvec{\beta }})$$\end{document} . Unfortunately, exact sampling is only possible under some situations such as restricted latent class analysis. In most cases, we only have approximate samples from an MCMC algorithm. For example, as discussed below, the latent variables in IFA can be sampled by a block-wise Gibbs sampler. With approximate samples, G ( t ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {G}}^{(t)}$$\end{document} is only approximately unbiased. As we show in Sect. 3, such G ( t ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {G}}^{(t)}$$\end{document} may still yield convergence of β ¯ n \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\bar{{\varvec{\beta }}}}_n$$\end{document} .

Remark 7

(Stopping criterion) In the implementation of Algorithm 1, we stop the iterative update by monitoring a window of successive differences in β ( t ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\varvec{\beta }}^{(t)}$$\end{document} . More precisely, we stop the iteration if all differences in the window are less than a given threshold. Unless otherwise stated, the numerical analysis in this paper uses a window size 3. The same stopping criterion is also adopted by the Metropolis–Hasting Robins–Monro algorithm proposed by Cai (Reference Cai2010a).

Finally, as we explain in Remark 8, certain quantities, including the Hessian matrix of l ( β ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$l({\varvec{\beta }})$$\end{document} , can be obtained as a by-product of the proposed algorithm.

Remark 8

(By-product) It is often of interest to compute quantities of the form

(17) M ^ = E m ( y , ξ β ) y , β | β = β ^ , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} {\hat{M}} = {\mathbb {E}}\left[ m({\mathbf {y}}, \varvec{\xi }\mid {\varvec{\beta }})\mid {\mathbf {y}}, {\varvec{\beta }}\right] \big \vert _{{\varvec{\beta }}= {\hat{{\varvec{\beta }}}}}, \end{aligned}$$\end{document}

where m ( y , ξ β ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$m({\mathbf {y}}, \varvec{\xi }\mid {\varvec{\beta }})$$\end{document} is a given function with an analytic form and the conditional expectation E · y , β \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbb {E}}\left[ \cdot \mid {\mathbf {y}}, {\varvec{\beta }}\right] $$\end{document} is with respect to the conditional distribution of ξ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\xi }$$\end{document} given y \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {y}}$$\end{document} . The quantity (17) is intractable due to the high-dimensional integral with respect to ξ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\xi }$$\end{document} . One such example is the Hessian matrix of l ( β ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$l(\varvec{\beta })$$\end{document} at β ^ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\hat{{\varvec{\beta }}}}$$\end{document} as discussed in Sect. 1.2 that is a key quantity for the statistical inference of β ^ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\hat{{\varvec{\beta }}}}$$\end{document} . In fact, by Louis’ formula (Louis, Reference Louis1982),

2 l ( β ) β β = E 2 ( log f ( y , ξ β ) ) β β + ( log f ( y , ξ β ) ) β ( log f ( y , ξ β ) ) β y , β - E ( log f ( y , ξ β ) ) β y , β E ( log f ( y , ξ β ) ) ) β y , β . \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} \begin{aligned} \frac{\partial ^{2} l(\varvec{\beta })}{\partial \varvec{\beta } \partial \varvec{\beta }^{\top }}=&{\mathbb {E}}\left[ \left. \frac{\partial ^{2} (\log f({\mathbf {y}}, \varvec{\xi }\mid \varvec{\beta }))}{\partial {\varvec{\beta }}\partial {\varvec{\beta }}^{\top }} + \frac{\partial (\log f({\mathbf {y}}, \varvec{\xi }\mid \varvec{\beta }))}{\partial {\varvec{\beta }}}\left[ \frac{\partial (\log f({\mathbf {y}}, \varvec{\xi }\mid \varvec{\beta }))}{\partial {\varvec{\beta }}}\right] ^{\top } \right| {\mathbf {y}}, {\varvec{\beta }}\right] \\&- {\mathbb {E}}\left[ \left. \frac{\partial (\log f({\mathbf {y}}, \varvec{\xi }\mid \varvec{\beta }))}{\partial {\varvec{\beta }}} \right| {\mathbf {y}}, {\varvec{\beta }}\right] \left( {\mathbb {E}}\left[ \left. \frac{\partial (\log f({\mathbf {y}}, \varvec{\xi }\mid \varvec{\beta })))}{\partial {\varvec{\beta }}} \right| {\mathbf {y}}, {\varvec{\beta }}\right] \right) ^\top . \end{aligned} \end{aligned}$$\end{document}

The computation of (17) is a straightforward by-product of the proposed algorithm. To approximate M ^ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\hat{M}}$$\end{document} , we only need to add the following update in each iteration

(18) M ( t ) = M ( t - 1 ) + γ t ( m ( y , ξ ( t ) β ( t ) ) - M ( t - 1 ) ) , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} M^{(t)} = M^{(t-1)} + \gamma _t\big (m({\mathbf {y}},\varvec{\xi }^{(t)}\mid {\varvec{\beta }}^{(t)})-M^{(t-1)}\big ), \end{aligned}$$\end{document}

for t 2 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$t \ge 2$$\end{document} , where M ( 1 ) = m ( y , ξ ( 1 ) β ( 1 ) ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$M^{(1)} = m({\mathbf {y}},\varvec{\xi }^{(1)}\mid {\varvec{\beta }}^{(1)})$$\end{document} . We approximate M ^ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\hat{M}}$$\end{document} by the Polyak–Ruppert averaging M ¯ n = ( t = ϖ + 1 n M ( t ) ) / ( n - ϖ ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\bar{M}}_n = (\sum _{t=\varpi + 1}^n M^{(t)})/(n-\varpi )$$\end{document} . When the sequence β ( t ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\varvec{\beta }}^{(t)}$$\end{document} converges to β ^ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\hat{{\varvec{\beta }}}}$$\end{document} (see Theorem 1 for the convergence analysis), under mild conditions, Theorem 3.17 of Ben-veniste et al. (Reference Benveniste, Priouret and Métivier1990) suggests the convergence of M ( n ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$M^{(n)}$$\end{document} to M ^ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\hat{M}}$$\end{document} with probability 1, which further implies the convergence of M ¯ n \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\bar{M}}_n$$\end{document} to M ^ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\hat{M}}$$\end{document} . Note that we use the averaged estimator M ¯ n \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\bar{M}}_n$$\end{document} as it tends to converge faster than the pre-average sequence M ( n ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$M^{(n)}$$\end{document} . We point out that the updating rule for the diagonal matrix D ( t ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {D}}^{(t)}$$\end{document} in Algorithm 1 makes use of such an averaged estimator.

Remark 9

(Burn-in size) Like MCMC algorithms, the proposed method also has a burn-in period, where parameter updates from that period are not used in the Polyak–Ruppert averaging. The choice of the burn-in size will not affect the asymptotic property of the method, but does affect the empirical performance. This is because, the parameter updates may be far away from the solution due to the effect of the starting point. Including them in the Polyak–Ruppert averaging may introduce a high bias. In our numerical analysis, the burn-in size ϖ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varpi $$\end{document} is fixed to be sufficiently large in each of our examples. Adaptive choice of the burn-in size is possible; see S. Zhang et al. (Reference Zhang, Chen and Liu2020b).

2.2. Example I: Item Factor Analysis

We now explain the details of using the proposed method to solve (5) for exploratory IFA. The computation is similar when replacing the L 1 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$L_1$$\end{document} regularization by the elastic net regularization. For confirmatory IFA, the stochastic step is the same as that of exploratory IFA and the proximal update step is straightforward as no penalty is involved. Therefore, the details for the computation of confirmatory IFA are omitted here.

We first consider the stochastic step for solving (5). Note that ξ 1 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\xi }_1$$\end{document} ,..., ξ N \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\xi }_N$$\end{document} are conditionally independent given data, and thus can be sampled separately. For each ξ i \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\xi }_i$$\end{document} , we sample its entries by Gibbs sampling. More precisely, each entry is sampled by adaptive rejection sampling (Gilks & Wild, Reference Gilks and Wild1992; S. Zhang et al., Reference Zhang, Chen and Liu2020b), as the conditional distribution of ξ ik \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\xi _{ik}$$\end{document} given data and the other entries of ξ i \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\xi }_i$$\end{document} is log-concave. We refer the readers to S. Zhang et al. (Reference Zhang, Chen and Liu2020b) for more explanations of this sampling procedure. If a normal ogive IFA is considered instead of the logistic model above, then we can sample ξ i ( t ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\xi }^{(t)}_i$$\end{document} by a similar Gibbs method with a data augmentation trick; see Chen & Zhang (Reference Chen, Zhang, Zhao and Chen2020a) for a review.

We now discuss the computation for the proximal step. Recall that β = { B , d j , a j , j = 1 , , J } \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\varvec{\beta }}= \{{\mathbf {B}}, d_j, {\mathbf {a}}_j, j =1,\ldots , J\}$$\end{document} . We denote

β ~ ( t ) = β ( t - 1 ) - γ t ( D ( t ) ) - 1 G ( t ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} {\tilde{{\varvec{\beta }}}}^{(t)} = {\varvec{\beta }}^{(t-1)} - \gamma _t({\mathbf {D}}^{(t)})^{-1}{\mathbf {G}}^{(t)} \end{aligned}$$\end{document}

as the input of the scaled proximal operator. The parameter update is given by

β ( t ) = Prox γ , g D ( β ~ ( t ) ) = arg min β g ( β ) + 1 2 γ t i = 1 p δ i ( t ) ( β i - β ~ i ) 2 , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} {\varvec{\beta }}^{(t)} = \text {Prox}_{\gamma , g}^{{\mathbf {D}}}({\tilde{{\varvec{\beta }}}}^{(t)}) = \mathop {\hbox {arg min}}\limits _{{\varvec{\beta }}} \left\{ g({\varvec{\beta }}) + \frac{1}{2\gamma _t} \sum _{i=1}^p \delta _i^{(t)}(\beta _i - {\tilde{\beta }}_i )^2\right\} , \end{aligned}$$\end{document}

where the parameter space

B = { β R p : b k k = 0 , 1 k < k K , k = 1 K b k k 2 = 1 , k = 1 , , K } , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} {\mathcal {B}} = \{{\varvec{\beta }}\in {\mathbb {R}}^{p}: b_{kk'} = 0, 1\le k < k' \le K, \sum _{k' = 1}^K b_{kk'}^2 = 1, k = 1,\ldots , K\}, \end{aligned}$$\end{document}

and g ( β ) = λ j = 1 J k = 1 K | a jk | + I B ( β ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$g({\varvec{\beta }}) = \lambda \sum _{j = 1}^J\sum _{k=1}^K \vert a_{jk}\vert + I_{{\mathcal {B}}}({\varvec{\beta }})$$\end{document} only involves loading parameters a jk \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$a_{jk}$$\end{document} and parameters B \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {B}}$$\end{document} for the covariance matrix.

We first look at the update for d j \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$d_j$$\end{document} s. As the g function does not involve d j \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$d_j$$\end{document} , its update is simply d j ( t ) = d ~ j ( t ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$d_j^{(t)} = {\tilde{d}}_j^{(t)}$$\end{document} , where d ~ j ( t ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\tilde{d}}_j^{(t)}$$\end{document} is the corresponding component in β ~ ( t ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\tilde{{\varvec{\beta }}}}^{(t)}$$\end{document} . We then look at the update for the loading parameters a jk \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$a_{jk}$$\end{document} . Suppose that a jk \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$a_{jk}$$\end{document} corresponds to the i a jk \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$i_{a_{jk}}$$\end{document} th component of β \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\varvec{\beta }}$$\end{document} . Then the update is given by solving the optimization

a jk ( t ) = arg min a jk λ | a jk | + 1 2 γ t δ i a jk ( t ) ( a jk - a ~ jk ( t ) ) 2 . \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} a_{jk}^{(t)} = \mathop {\hbox {arg min}}\limits _{a_{jk}}~ \lambda |a_{jk}| + \frac{1}{2\gamma _t} \delta _{i_{a_{jk}}}^{(t)} (a_{jk} - {\tilde{a}}_{jk}^{(t)})^2. \end{aligned}$$\end{document}

As discussed in Remark 2, this optimization has a closed-form solution via soft-thresholding. We finally look at the update for B \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {B}}$$\end{document} . Suppose that b kl \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$b_{kl}$$\end{document} corresponds to the i b kl \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$i_{b_{kl}}$$\end{document} th component of β \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\varvec{\beta }}$$\end{document} . Then the update of b k \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {b}}_k$$\end{document} , the kth row of B \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {B}}$$\end{document} , is given by solving the following optimization problem:

b k ( t ) = arg min b k : b k = 1 , b k k = 0 , k > k l = 1 K δ i b kl ( t ) ( b kl - b ~ kl ( t ) ) 2 , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} {\mathbf {b}}_k^{(t)} = \mathop {\hbox {arg min}}\limits _{{\mathbf {b}}_k: \Vert {\mathbf {b}}_k\Vert =1, b_{kk'} = 0, k' > k} \left\{ \sum _{l=1}^K \delta _{i_{b_{kl}}}^{(t)} (b_{kl} - {\tilde{b}}_{kl}^{(t)})^2 \right\} , \end{aligned}$$\end{document}

which can be easily solved by the method of Lagrangian multiplier.

2.3. Example II: Restricted LCA

We now provide a brief discussion on the computation for the restricted LCA model. First, the stochastic step is straightforward, as the posterior distribution for each ξ i \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\xi }_i$$\end{document} is still a categorical distribution which can be sampled exactly. Second, the proximal step requires to solve a quadratic programming problem. Again, we denote

β ~ ( t ) = β ( t - 1 ) - γ t ( D ( t ) ) - 1 G ( t ) . \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} {\tilde{{\varvec{\beta }}}}^{(t)} = {\varvec{\beta }}^{(t-1)} - \gamma _t({\mathbf {D}}^{(t)})^{-1}{\mathbf {G}}^{(t)}. \end{aligned}$$\end{document}

The proximal step requires to solve the following quadratic programming problem

(19) min β ( β - β ~ ( t ) ) D ( t ) ( β - β ~ ( t ) ) , s . t . max α q j θ j , α = min α q j θ j , α θ j , α θ j , 0 , for all α ⪰̸ q j , and ν 0 = 0 . \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} \begin{aligned}&\min _{{\varvec{\beta }}} ({\varvec{\beta }}- {\tilde{{\varvec{\beta }}}}^{(t)})^\top {\mathbf {D}}^{(t)} ({\varvec{\beta }}- {\tilde{{\varvec{\beta }}}}^{(t)}), \\ s.t.~&\max _{\varvec{\alpha }\succeq {\mathbf {q}}_{j}} \theta _{j, \varvec{\alpha }} = \min _{\varvec{\alpha }\succeq {\mathbf {q}}_{j}} \theta _{j, \varvec{\alpha }} \ge \theta _{j, \varvec{\alpha }'}\ge \theta _{j,{\varvec{0}}}, ~\text{ for } \text{ all }~ \varvec{\alpha }'\nsucceq {\varvec{q}}_j,\\&~\text{ and }~ \nu _{{\mathbf {0}}} = 0. \end{aligned} \end{aligned}$$\end{document}

Quadratic programming is the most studied nonlinear convex optimization problem (Chapter 4, Boyd et al., Reference Boyd, Boyd and Vandenberghe2004) and many efficient solvers exist. In our simulation study in Sect. 4.3, we use the dual method of Goldfarb & Idnani (Reference Goldfarb and Idnani1983) implemented in the R package quadprog (Turlach et al., Reference Turlach, Weingessel and Moler2019).

2.4. Comparison with Related Algorithms

We compare Algorithm 1 with several related algorithms in more details.

2.4.1. Robbins-Monro SA and Variants

The proposed method is closely related to the stochastic approximation approach first proposed in Robbins & Monro (Reference Robbins and Monro1951), and its variants given in Gu & Kong (Reference Gu and Kong1998) and Cai (Reference Cai2010a) that are specially designed for latent variable model estimation. Note that the Robbins–Monro method is the first SGD method with convergence guarantee. Both the methods of Gu & Kong (Reference Gu and Kong1998) and Cai (Reference Cai2010a) approximate the original Robbins–Monro method by using MCMC sampling to generate an approximate stochastic gradient in each iteration, when an unbiased stochastic gradient is difficult to obtain. All these methods do not handle complex constraints or non-smooth objective functions.

When there is no constraint or penalty on parameters (i.e., g ( β ) 0 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$g({\varvec{\beta }}) \equiv 0$$\end{document} ), the proximal operator degenerates to an identity map. In this case, the proposed method is essentially the same as Gu & Kong (Reference Gu and Kong1998) and Cai (Reference Cai2010a), except for the sampling method in the stochastic step, the way the Hessian matrix is approximated, the specific choices of step size, and the averaging in the last step of the proposed method. Among these differences, the step size and the trajectory averaging are key to the advantage of the proposed method.

As pointed out in Remark 4, the Robbins–Monro procedure has the same general requirement on the step size as the proposed method. Specially, the Robbins–Monro procedure, as well as its MCMC variants (Gu & Kong Reference Gu and Kong1998; Cai Reference Cai2010a), typically let the step size γ t \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\gamma _t$$\end{document} decay in the order 1/t as suggested by asymptotic theory (Fabian, Reference Fabian1968). However, this step is often too short at the early stage of the algorithm, resulting in poor performance in practice (Sect. 4.5.3., Spall, Reference Spall2003). On the other hand, the proposed method adopts a longer step size. By further adopting Polyak–Ruppert averaging (Ruppert, Reference Ruppert1988; Polyak & Juditsky, Reference Polyak and Juditsky1992), we show in Sect. 3 that the proposed method almost achieves the optimal convergence speed.

2.4.2. Perturbed Proximal Gradient Algorithm

Proximal gradient descent algorithm (Parikh & Boyd, Reference Parikh and Boyd2014) is a non-stochastic algorithm for solving nonsmooth and/or constrained optimization algorithms. For example, the widely used gradient projection algorithm for oblique rotation in factor analysis (Jennrich, Reference Jennrich2002) is a special case. The vanilla proximal gradient descent algorithm does not use the second-order information of the objective function and thus sometimes converges slowly. To improve convergence speed, proximal Newton-type methods have been proposed in Lee et al. (Reference Lee, Sun and Saunders2014) that utilize the second-order information of the smooth part of the objective function.

The perturbed proximal gradient algorithm (Atchadé et al., Reference Atchadé, Fort and Moulines2017) solves a similar optimization problem as in (2) by combining the methods of stochastic approximation, proximal gradient decent, and Polyak–Ruppert averaging. The proposed method extends Atchadé et al., (Reference Atchadé, Fort and Moulines2017) by adopting a Newton-type proximal update suggested in (Reference Lee, Sun and Saunders2014). The method of Atchadé et al., (Reference Atchadé, Fort and Moulines2017) can be viewed as a special case of the proposed one with c 1 = c 2 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$c_{1} = c_{2}$$\end{document} . As shown by simulation study in the sequel, thanks to the second-order information, the proposed method converges much faster than that of Atchadé et al., (Reference Atchadé, Fort and Moulines2017). We also point out that the theoretical analysis of Atchadé et al., (Reference Atchadé, Fort and Moulines2017) focuses on convex optimization, while in Sect. 3 we consider a more general setting of non-convex optimization that includes a wide range of latent variable model estimation problems as special cases.

2.4.3. Stochastic EM Algorithm

The proposed method is also closely related to the stochastic-EM algorithm (Celeux, Reference Celeux1985; Ip, Reference Ip2002; Nielsen, Reference Nielsen2000; S. Zhang et al., Reference Zhang, Chen and Liu2020b). The stochastic-EM algorithm is a similar iterative algorithm, consisting of a stochastic step and a maximization step in each iteration, where the stochastic step is the same as that in the proposed algorithm. The maximization step plays a similar role as the proximal step in the proposed algorithm. More precisely, when there is no constraint or penalty, the maximization step of the stochastic-EM algorithm obtains parameter update β ( t ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\varvec{\beta }}^{(t)}$$\end{document} by minimizing the negative complete data log-likelihood function - log f ( y , ξ ( t ) β ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$-\log f({\mathbf {y}}, \varvec{\xi }^{(t)}\mid {\varvec{\beta }})$$\end{document} , instead of a stochastic gradient update. It is also recommended to perform a trajectory averaging in the stochastic-EM algorithm (Nielsen, Reference Nielsen2000; S. Zhang et al., Reference Zhang, Chen and Liu2020b), like the last step of the proposed algorithm. As pointed out in S. Zhang et al. (Reference Zhang, Chen and Liu2020b), the stochastic EM algorithm can potentially handle constraints and non-smooth penalties on parameters by incorporating them into the maximization step.

The stochastic-EM algorithm is typically not as fast as the proposed method, which is revealed by simulation studies below. This is because, it requires to solve an optimization problem completely in each iteration, which is time consuming, especially when constraints and non-smooth penalties are involved. On the other hand, the proximal step of the proposed algorithm can often be efficiently performed.

3. Theoretical Properties

In what follows, we establish the asymptotic properties of the proposed algorithm, under suitable technical conditions. For readers who are not interested in the asymptotic theory, this section can be skipped without affecting the reading of the rest of the paper. Note that in this section, we view data as fixed and the randomness comes from sampling of the latent variables. The following expectation is taken with respect to latent variable ξ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\xi }$$\end{document} given data y \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\varvec{y}}$$\end{document} and parameters β , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\varvec{\beta }},$$\end{document} denoted by E ( · β ) = · π β ( ξ ) d ξ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbb {E}}(\cdot \mid {\varvec{\beta }})=\int \cdot \pi _{{\varvec{\beta }}}(\varvec{\xi })d\varvec{\xi }$$\end{document} , where π β \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\pi _{\varvec{\beta }}$$\end{document} is the posterior distribution for ξ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\xi }$$\end{document} given y \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\varvec{y}}$$\end{document} and β . \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\varvec{\beta }}.$$\end{document} Let · \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\Vert \cdot \Vert $$\end{document} denote the vector l 2 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$l_2$$\end{document} -norm. Following the typical convergence analysis of non-convex optimization (e.g., Chapter 3, Floudas, Reference Floudas1995), we will first discuss the convergence of the sequence β ( t ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\varvec{\beta }}^{(t)}$$\end{document} to a stationary point of the objective function h ( β ) + g ( β ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$h({\varvec{\beta }}) + g({\varvec{\beta }})$$\end{document} in Theorem 1, which follows the theoretical development in Duchi & Ruan (Reference Duchi and Ruan2018). Then with some additional assumptions on the local geometry of the objective function at the stationary point being converged to, we will show the convergence rate of the Polyak–Ruppert averaged sequence β ¯ n \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\bar{{\varvec{\beta }}}}_n$$\end{document} in Theorem 2 which extends the results of Atchadé et al. (Reference Atchadé, Fort and Moulines2017) to the setting of non-convex optimization.

For a function f : R d R + , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$f:{\mathbb {R}}^d\mapsto {\mathbb {R}}\cup +\infty ,$$\end{document} denote the Fréchet subdifferential (Chapter 8.B Rockafellar & Wets, Reference Rockafellar and Wets1998) of f at the point x \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\varvec{x}}$$\end{document} by f ( x ) , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\partial f({\varvec{x}}),$$\end{document}

f ( x ) = z R p : f ( y ) f ( x ) + z ( y - x ) + o ( y - x ) as y x . \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} \partial f({\varvec{x}})=\left\{ {\varvec{z}} \in {\mathbb {R}}^{p}: f({\varvec{y}}) \ge f({\varvec{x}})+ {\varvec{z}}^\top ({\varvec{y}}-{\varvec{x}})+o(\Vert {\varvec{y}}-{\varvec{x}}\Vert ) \text{ as } {\varvec{y}} \rightarrow {\varvec{x}}\right\} . \end{aligned}$$\end{document}

Define the set of stationary points of the objective function as

B = { β B : x h ( β ) + g ( β ) with x ( y - β ) 0 , for all y B } . \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} {\mathcal {B}}^* = \{{\varvec{\beta }}\in {\mathcal {B}}: \exists \ {\mathbf {x}} \in \partial h({\varvec{\beta }}) + \partial g({\varvec{\beta }}) ~\text{ with }~ {\mathbf {x}}^\top ({\mathbf {y}}-{\varvec{\beta }}) \ge 0, ~\text{ for } \text{ all }~ {\mathbf {y}}\in {\mathcal {B}}\}. \end{aligned}$$\end{document}

Note that the global minimum β ^ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\hat{{\varvec{\beta }}}}$$\end{document} is a stationary point, i.e., β ^ B \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\hat{{\varvec{\beta }}}} \in {\mathcal {B}}^*$$\end{document} . In addition, when the objective function is smooth, i.e., g ( β ) 0 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$g({\varvec{\beta }}) \equiv 0$$\end{document} , then B = { β B : h ( β ) = 0 } , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathcal {B}}^* = \{{\varvec{\beta }}\in {\mathcal {B}}: \nabla h({\varvec{\beta }}) = 0 \},$$\end{document} which is the standard definition of stationary points set for a smooth function.

The following assumptions are assumed for our objective function.

  1. H1. B \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathcal {B}}$$\end{document} is compact and contains finite stationary points. For stationary points β , β B , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\varvec{\beta }},{\varvec{\beta }}^\prime \in {\mathcal {B}}^*,$$\end{document} h ( β ) + g ( β ) = h ( β ) + g ( β ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$h({\varvec{\beta }})+ g({\varvec{\beta }})= h({\varvec{\beta }}^\prime )+g({\varvec{\beta }}^\prime )$$\end{document} if and only if β = β . \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\varvec{\beta }}= {\varvec{\beta }}^\prime .$$\end{document}

  2. H2. H ( ξ , β ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$H(\varvec{\xi },{\varvec{\beta }})$$\end{document} is a differentiable function with respect to β \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\varvec{\beta }}$$\end{document} for given ξ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\xi }$$\end{document} and let G β ( ξ ) = H ( ξ , β ) / β . \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {G}}_{{\varvec{\beta }}}(\varvec{\xi })= \partial H(\varvec{\xi },{\varvec{\beta }})/\partial {\varvec{\beta }}.$$\end{document} Define function M ϵ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$M_\epsilon $$\end{document} : Θ × Ξ R + \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\Theta \times \Xi \rightarrow {\mathbb {R}}_+$$\end{document} as

    M ϵ ( β ; ξ ) = sup β B , β - β < ϵ G β ( ξ ) . \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} M_\epsilon ({\varvec{\beta }};\varvec{\xi }) = \sup _{{\varvec{\beta }}'\in {\mathcal {B}},\Vert {\varvec{\beta }}'-{\varvec{\beta }}\Vert <\epsilon }\Vert {\mathbf {G}}_{{\varvec{\beta }}'}(\varvec{\xi })\Vert . \end{aligned}$$\end{document}
    There exists ϵ 0 > 0 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\epsilon _0>0$$\end{document} such that for all 0 < ϵ < ϵ 0 , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$0<\epsilon <\epsilon _0,$$\end{document}
    E [ M ϵ ( β ; ξ ) 2 β ] < for all β B . \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} {\mathbb {E}}[M_\epsilon ({\varvec{\beta }};\varvec{\xi })^2\mid {\varvec{\beta }}]<\infty \text { for all }{\varvec{\beta }}\in {\mathcal {B}}. \end{aligned}$$\end{document}
  3. H3. There exists ϵ 0 > 0 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\epsilon _0>0$$\end{document} such that for all β B , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\varvec{\beta }}^\prime \in {\mathcal {B}},$$\end{document} there exists λ ( ξ , β ) 0 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\lambda (\varvec{\xi },{\varvec{\beta }}^\prime )\ge 0$$\end{document} such that

    β H ( ξ , β ) + λ ( ξ , β ) 2 β - β 0 2 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} {\varvec{\beta }}\mapsto H(\varvec{\xi },{\varvec{\beta }}) + \frac{\lambda (\varvec{\xi },{\varvec{\beta }}^\prime )}{2}\Vert {\varvec{\beta }}-{\varvec{\beta }}_0\Vert ^2 \end{aligned}$$\end{document}
    is convex on the set { β : β - β ϵ 0 } \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\{{\varvec{\beta }}: \Vert {\varvec{\beta }}-{\varvec{\beta }}^\prime \Vert \le \epsilon _0\}$$\end{document} for any β 0 , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\varvec{\beta }}_0,$$\end{document} and E [ λ ( ξ , β ) β ] < . \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbb {E}}[\lambda (\varvec{\xi },{\varvec{\beta }}^\prime )\mid {\varvec{\beta }}]<\infty .$$\end{document}
  4. H4. The stochastic gradient G β ( t - 1 ) ( ξ ( t ) ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {G}}_{{\varvec{\beta }}^{(t-1)}}(\varvec{\xi }^{(t)})$$\end{document} is a Monte Carlo approximation of h ( β ( t - 1 ) ) . \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\nabla h({\varvec{\beta }}^{(t-1)}).$$\end{document} That is, if computationally feasible, we take ξ ( t ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\xi }^{(t)}$$\end{document} as an exact sample from π β ( t - 1 ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\pi _{{\varvec{\beta }}^{(t-1)}}$$\end{document} , where, as defined earlier, π β ( t - 1 ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\pi _{{\varvec{\beta }}^{(t-1)}}$$\end{document} is the posterior distribution of ξ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\xi }$$\end{document} given y \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\varvec{y}}$$\end{document} and β ( t - 1 ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\varvec{\beta }}^{(t-1)}$$\end{document} . If not, we sample ξ ( t ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\xi }^{(t)}$$\end{document} from a Markov kernel P β ( t - 1 ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$P_{{\varvec{\beta }}^{(t-1)}}$$\end{document} with invariant distribution π β ( t - 1 ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\pi _{{\varvec{\beta }}^{(t-1)}}$$\end{document} .

  5. H5. Define

    (20) β γ + ( ξ ) = argmin x B [ G β ( ξ ) ] ( x - β ) + g ( x ) + 1 2 γ x - β D 2 , U γ ( ξ ; β ) = 1 γ ( β - β γ + ( ξ ) ) , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} \begin{aligned}&{\varvec{\beta }}_{\gamma }^+(\varvec{\xi }) = \underset{{\varvec{x}} \in {\mathcal {B}}}{{\text {argmin}}}\left\{ [{\varvec{G}}_{{\varvec{\beta }}}(\varvec{\xi })]^\top ({\varvec{x}}-{\varvec{\beta }})+{\mathbf {g}}({\varvec{x}})+\frac{1}{2 \gamma }\left\| {\varvec{x}}-{\varvec{\beta }}\right\| ^{2}_{{\mathbf {D}}}\right\} ,\\&{\varvec{U}}_{\gamma }(\varvec{\xi };{\varvec{\beta }}) = \frac{1}{\gamma }({\varvec{\beta }}- {\varvec{\beta }}_{\gamma }^+(\varvec{\xi })), \end{aligned} \end{aligned}$$\end{document}
    where step size satisfy t = 1 γ t = , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\sum _{t=1}^\infty \gamma _t = \infty ,$$\end{document} t = 1 γ t 2 < . \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\sum _{t=1}^\infty \gamma _t^2<\infty .$$\end{document} Then with probability 1,
    lim n t = 1 n γ t U γ t ( ξ ( t ) ; β ( t - 1 ) ) - E [ U γ t ( ξ ( t ) ; β ( t - 1 ) ) β ( t - 1 ) ] \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} \lim _{n\rightarrow \infty }\sum _{t=1}^n\gamma _t\left( {\varvec{U}}_{\gamma _t} (\varvec{\xi }^{(t)};{\varvec{\beta }}^{(t-1)}) - {\mathbb {E}}[{\varvec{U}}_{\gamma _t}(\varvec{\xi }^{(t)};{\varvec{\beta }}^{(t-1)})\mid {\varvec{\beta }}^{(t-1)}]\right) \end{aligned}$$\end{document}
    exists and is finite.

We remark that conditions H1 through H5 are quite mild. Condition H1 imposes mild requirements on the compactness of the parameter space and the properties of the stationary points of the objective function. Specifically, the compactness of the parameter space is often assumed when analyzing stochastic optimization problems without assuming convexity, see e.g., Gu & Kong (Reference Gu and Kong1998), Nielsen (Reference Nielsen2000), Cai (Reference Cai2010b), and Duchi & Ruan (Reference Duchi and Ruan2018). It also requires that the objective function has different values at different stationary points. Conditions H2 and H3 require the complete-data log-likelihood function H ( ξ , · ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$H(\varvec{\xi },\cdot )$$\end{document} is locally Lipschitzian and weakly convex, respectively. These conditions hold when the complete-data log-likelihood function H ( ξ , · ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$H(\varvec{\xi },\cdot )$$\end{document} is Lipschitzian and convex on the entire parameter space. Requiring locally Lipschitzian and weakly convex enables our theory to be applicable to a wider range of problems. Similar conditions are imposed in Duchi & Ruan (Reference Duchi and Ruan2018). For the examples that we consider in Sects. 1.2 and 1.3, these two conditions are satisfied because H ( ξ , β ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$H(\varvec{\xi },{\varvec{\beta }})$$\end{document} is smooth and convex in β \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\varvec{\beta }}$$\end{document} . Condition H4 is automatically satisfied according to the way the latent variables are sampled in Algorithm 1. Finally, H5 is a key condition for the convergence of the sequence β ( t ) . \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\varvec{\beta }}^{(t)}.$$\end{document} When exact samples from the posterior distribution are used, Lemma 1 guarantees that H5 is satisfied. With approximate samples from an MCMC algorithm, H5 may still hold when the bias from the MCMC samples is small.

Lemma 1

Define the filtration of σ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\sigma $$\end{document} -algebra F t - 1 = σ β ( 0 ) , ξ ( k ) , 0 k t - 1 . \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathcal {F}}_{t-1} = \sigma \left( {\varvec{\beta }}^{(0)}, \varvec{\xi }^{(k)}, 0 \le k \le t-1\right) .$$\end{document} ξ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\xi }$$\end{document} is a sample from π β . \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\pi _{\varvec{\beta }}.$$\end{document} Let

ϵ γ ( ξ ; β ) = U γ ( ξ ; β ) - E [ U γ ( ξ ; β ) β ] , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} \varvec{\epsilon }_{\gamma }(\varvec{\xi };{\varvec{\beta }}) = {\varvec{U}}_{\gamma }(\varvec{\xi };{\varvec{\beta }}) - {\mathbb {E}}[{\varvec{U}}_{\gamma }(\varvec{\xi };{\varvec{\beta }})\mid {\varvec{\beta }}], \end{aligned}$$\end{document}

then γ t ϵ γ t ( ξ ( t ) , β ( t - 1 ) ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\gamma _t\varvec{\epsilon }_{\gamma _t}(\varvec{\xi }^{(t)},{\varvec{\beta }}^{(t-1)})$$\end{document} is a square-integrable martingale difference sequence adapted to F t - 1 , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathcal {F}}_{t-1},$$\end{document} and with probability 1, lim n t = 1 n γ t ϵ γ t ( ξ ( t ) , β ( t - 1 ) ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\lim _n \sum _{t=1}^n \gamma _t \varvec{\epsilon }_{\gamma _t}(\varvec{\xi }^{(t)},{\varvec{\beta }}^{(t-1)})$$\end{document} exists and is finite.

Theorem 1

Apply Algorithm 1 to optimization problem (11) with step size γ t = t - 1 2 - ϵ , ϵ ( 0 , 1 2 ] \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\gamma _t=t^{-\frac{1}{2}-\epsilon },\epsilon \in (0,\frac{1}{2}]$$\end{document} , for which conditions H1-H5 hold. Then with probability 1, the sequence β ( n ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\varvec{\beta }}^{(n)}$$\end{document} converges to a stationary point in B \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathcal {B}}^*$$\end{document} .

We remark that the convergence of the proposed method is similar to that of the EM algorithm. In fact, for marginal maximum likelihood estimation that is non-convex, the EM algorithm also only guarantees the convergence to a stationary point (Wu, Reference Wu1983). Moreover, when the objective function has a single stationary point (e.g., when the objective function is strictly convex), then Theorem 1 guarantees global convergence.

The convergence of β ( n ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\varvec{\beta }}^{(n)}$$\end{document} guarantees the convergence of the Polyak–Ruppert averaging sequence β ¯ n \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\bar{{\varvec{\beta }}}}_n$$\end{document} . However, Theorem 1 does not provide information on the convergence speed. In what follows, we establish the convergence speed of β ¯ n \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\bar{{\varvec{\beta }}}}_n$$\end{document} . Without loss of generality, by Theorem 1, we assume that β ( n ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\varvec{\beta }}^{(n)}$$\end{document} converges to β B . \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\varvec{\beta }}_* \in {\mathcal {B}}^*.$$\end{document}

  1. H6. There exists δ > 0 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\delta >0$$\end{document} , such that h ( β ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$h({\varvec{\beta }})$$\end{document} is strongly convex in B 1 = { β B : β - β δ } \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathcal {B}}_1 = \{{\varvec{\beta }}\in {\mathcal {B}}: \Vert {\varvec{\beta }}- {\varvec{\beta }}_*\Vert \le \delta \}$$\end{document} and h ( β ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\nabla h({\varvec{\beta }})$$\end{document} is Lipschitz in B 1 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathcal {B}}_1$$\end{document} with Lipschitz constant L.

  2. H7. For β , β B 1 , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\varvec{\beta }},{\varvec{\beta }}' \in {\mathcal {B}}_1,$$\end{document} any γ > 0 , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\gamma >0,$$\end{document} and diagonal matrix D \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {D}}$$\end{document} with diagonal entries δ i [ c 1 , c 2 ] , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\delta _i\in [c_1,c_2],$$\end{document} the following conditions hold.

    1. (i) g Prox γ , g D ( β ) - g β - 1 γ Prox γ , g D ( β ) - β , Prox γ , g D ( β ) - β D \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$g\left( {\text {Prox}}^{{\mathbf {D}}}_{\gamma , g}({\varvec{\beta }})\right) -g\left( {\varvec{\beta }}^{\prime }\right) \le -\frac{1}{\gamma }\left\langle {\text {Prox}}^{{\mathbf {D}}}_{\gamma , g}({\varvec{\beta }})-{\varvec{\beta }}^{\prime }, {\text {Prox}}^{{\mathbf {D}}}_{\gamma , g}({\varvec{\beta }})-{\varvec{\beta }}\right\rangle _{{\mathbf {D}}}$$\end{document} .

    2. (ii) Prox γ , g D ( β ) - Prox γ , g D β D 2 + Prox γ , g D ( β ) - β - Prox γ , g D β - β D 2 β - β D 2 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\left\| {\text {Prox}}^{{\mathbf {D}}}_{\gamma , g}({\varvec{\beta }})-{\text {Prox}}^{{\mathbf {D}}}_{\gamma ,{} g}\left( {\varvec{\beta }}^{\prime }\right) \right\| _{{\mathbf {D}}}^{2} +\left\| \left( {\text {Prox}}^{{\mathbf {D}}}_{\gamma , g}({\varvec{\beta }})-{\varvec{\beta }}\right) \!-\!\left( {\text {Prox}}^{{\mathbf {D}}}_{\gamma , g}\left( {\varvec{\beta }}^{\prime }\right) -{\varvec{\beta }}^{\prime }\right) \right\| _{{\mathbf {D}}}^{2} \le \left\| {\varvec{\beta }}-{\varvec{\beta }}^{\prime }\right\| _{{\mathbf {D}}}^{2}$$\end{document} .

    3. (iii) sup γ ( 0 , c 1 / L ] sup β B 1 γ - 1 Prox γ , g D ( β ) - β < \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\sup _{\gamma \in (0,c_1 / L]} \sup _{{\varvec{\beta }}\in {\mathcal {B}}_1} \gamma ^{-1}\left\| {\text {Prox}}_{\gamma , g}^{{\mathbf {D}}}({\varvec{\beta }})-{\varvec{\beta }}\right\| <\infty $$\end{document} .

  3. H8. For a measurable function V : Ξ [ 1 , + ) , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$V: \Xi \rightarrow [1,+\infty ),$$\end{document} a signed measure μ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\mu $$\end{document} on the σ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\sigma $$\end{document} -field of Ξ , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\Xi ,$$\end{document} and a function f : Ξ R , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$f: \Xi \rightarrow {\mathbb {R}},$$\end{document} define

    | f | V = def sup ξ Ξ | f ( ξ ) | V ( ξ ) , μ V = def sup f , | f | V 1 f d μ . \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} |f|_{V} {\mathop {=}\limits ^{ \text{ def } }} \sup _{\varvec{\xi } \in \Xi } \frac{|f(\varvec{\xi })|}{V(\varvec{\xi })}, \quad \Vert \mu \Vert _{V} {\mathop {=}\limits ^{ \text{ def } }} \sup _{f,|f|_{V} \le 1}\left| \int f \mathrm {d} \mu \right| . \end{aligned}$$\end{document}
    There exist λ ( 0 , 1 ) , b < , m 4 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\lambda \in (0,1), b<\infty , m\ge 4$$\end{document} and a measurable function W : Ξ [ 1 , + ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$W: \Xi \rightarrow [1,+\infty )$$\end{document} such that
    sup β B 1 G β W < , sup β B 1 P β W m λ W m + b , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} \sup _{{\varvec{\beta }}\in {\mathcal {B}}_1}\left| {\mathbf {G}}_{{\varvec{\beta }}}\right| _{W}<\infty , \quad \sup _{{\varvec{\beta }}\in {\mathcal {B}}_1} P_{{\varvec{\beta }}} W^{m} \le \lambda W^{m}+b, \end{aligned}$$\end{document}
    where G β ( ξ ) = H ( ξ , β ) / β \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {G}}_{{\varvec{\beta }}}(\varvec{\xi })= \partial H(\varvec{\xi },{\varvec{\beta }})/\partial {\varvec{\beta }}$$\end{document} and P β \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$P_{{\varvec{\beta }}}$$\end{document} is the Markov kernel defined in condition H4. In addition, for any ( 0 , m ] , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\ell \in (0,m], $$\end{document} there exists C < , ρ ( 0 , 1 ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$C<\infty , \rho \in (0,1)$$\end{document} such that for any ξ Ξ , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\xi }\in \Xi ,$$\end{document}
    sup β B 1 P β n ( ξ , · ) - π β W C ρ n W ( ξ ) . \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} \sup _{{\varvec{\beta }}\in {\mathcal {B}}_1}\left\| P_{{\varvec{\beta }}}^{n}(\varvec{\xi }, \cdot )-\pi _{{\varvec{\beta }}}\right\| _{W^{\ell }} \le C \rho ^{n} W^{\ell }(\varvec{\xi }). \end{aligned}$$\end{document}
  4. H9. There exists a constant C such that for any β , β B 1 , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\varvec{\beta }},{\varvec{\beta }}^{\prime } \in {\mathcal {B}}_1, $$\end{document}

    G β - G β W + sup ξ Ξ P β ( ξ , · ) - P β ( ξ , · ) W W ( ξ ) + π β - π β W C β - β . \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} \left| {\mathbf {G}}_{{\varvec{\beta }}}-{\mathbf {G}}_{{\varvec{\beta }}^{\prime }}\right| _{W}+\sup _{\varvec{\xi }\in \Xi } \frac{\left\| P_{{\varvec{\beta }}}(\varvec{\xi }, \cdot )-P_{{\varvec{\beta }}^{\prime }}(\varvec{\xi }, \cdot )\right\| _{W}}{W(\varvec{\xi })}+\left\| \pi _{{\varvec{\beta }}}-\pi _{{\varvec{\beta }}^{\prime }}\right\| _{W} \le C\left\| {\varvec{\beta }}-{\varvec{\beta }}^{\prime }\right\| . \end{aligned}$$\end{document}

We provide a few remarks on conditions H6-H9, which are needed for establishing the convergence speed in addition to conditions H1–H5. Condition H6 requires that the smooth part of the objective function is strongly convex and its derivative is Lipschitz continuous in a small neighborhood of β \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\varvec{\beta }}_*$$\end{document} . Specifically, h ( β ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$h({\varvec{\beta }})$$\end{document} being strongly convex in B 1 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathcal {B}}_1$$\end{document} means that there exists a positive constant C, such that ( h ( β ) - h ( β ) ) ( β - β ) C β - β 2 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$(\nabla h({\varvec{\beta }})-\nabla h({\varvec{\beta }}'))^\top ({\varvec{\beta }}- {\varvec{\beta }}') \ge C\Vert {\varvec{\beta }}- {\varvec{\beta }}'\Vert ^2$$\end{document} , for any β \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\varvec{\beta }}$$\end{document} and β B 1 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\varvec{\beta }}' \in {\mathcal {B}}_1$$\end{document} . Condition H7 imposes some requirements on the non-smooth part of the objective function, with regard to the proximal operator. As verified in Lemma C.1, H7 holds when g is a generalized function that indicates constraints or when g is locally Lipschitz continuous and convex that holds when g is a L 1 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$L_1$$\end{document} regularization function. Thus, H7 holds for the examples we consider in Sects. 2.2 and 2.3. Conditions H8 and H9 impose mild regularity conditions on the stochastic gradient in a local neighborhood of β \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\varvec{\beta }}_*$$\end{document} , especially when the stochastic gradients are generated by a Markov kernel. These conditions are used to control the bias caused by MCMC sampling. H8 is essentially a uniform-in- β \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\varvec{\beta }}$$\end{document} ergodic condition and H9 is a local Lipschitzian condition on the Markov kernel. These regularity conditions are commonly adopted in the stochastic approximation literature (Benveniste et al., Reference Benveniste, Priouret and Métivier1990; Andrieu et al., Reference Andrieu, Moulines and Priouret2005; Fort et al., Reference Fort, Moulines, Schreck and Vihola2016) and have been shown to hold for general families of MCMC kernels including Metropolis–Hastings and Gibbs samplers (Andrieu & Moulines, Reference Andrieu and Moulines2006; Fort et al., Reference Fort, Moulines and Priouret2011; Schmidt et al., Reference Schmidt, Roux and Bach2011).

Theorem 2

Suppose that H1–H9 hold. Then there exists a constant C, such that for the Polyak–Ruppert averaging sequence β ¯ n = 1 n t = 1 n β ( t ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\bar{{\varvec{\beta }}}}_{n} = \frac{1}{n}\sum _{t=1}^n {\varvec{\beta }}^{(t)}$$\end{document} from Algorithm 1,

(21) E β ¯ n - β 2 C n - 1 2 + ε . \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} {\mathbb {E}}\Vert {\bar{{\varvec{\beta }}}}_n - {\varvec{\beta }}_\star \Vert ^2 \le C n^{-\frac{1}{2} + \varepsilon }. \end{aligned}$$\end{document}

Note that the expectation is taken with respect to ξ ( 1 ) , , ξ ( t ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\xi }^{(1)},\ldots ,\varvec{\xi }^{(t)}$$\end{document} given β ( 0 ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\varvec{\beta }}^{(0)}$$\end{document} and ξ ( 0 ) . \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\xi }^{(0)}.$$\end{document}

We now provide a few remarks regarding the convergence speed (21). First, the small positive constant ε \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varepsilon $$\end{document} comes from the requirement on step size that t = 1 γ t 2 < \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\sum _{t=1}^\infty \gamma _t^2 < \infty $$\end{document} in H5. Since t = 1 γ t 2 < \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\sum _{t=1}^\infty \gamma _t^2 < \infty $$\end{document} is satisfied when γ t = μ t - 1 2 - ε \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\gamma _t = \mu t^{-\frac{1}{2} - \varepsilon }$$\end{document} , for any μ , ε > 0 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\mu , \varepsilon > 0$$\end{document} , the convergence speed of E β ¯ n - β 2 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbb {E}}\Vert {\bar{{\varvec{\beta }}}}_n - {\varvec{\beta }}_\star \Vert ^2$$\end{document} can be arbitrarily close to O ( n - 1 2 ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$O(n^{-\frac{1}{2}})$$\end{document} by choosing an arbitrarily small ε \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varepsilon $$\end{document} . Second, this ε \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varepsilon $$\end{document} might be an artifact due to our proof strategy to overcome the non-convexity of the problem. In fact, if the objective function is convex, similar to Atchadé et al. (Reference Atchadé, Fort and Moulines2017), we can choose ϵ = 0 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\epsilon = 0$$\end{document} and then prove under similar conditions that E β ¯ n - β 2 C n - 1 2 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbb {E}}\Vert {\bar{{\varvec{\beta }}}}_n - {\varvec{\beta }}_\star \Vert ^2 \le C n^{-\frac{1}{2}}$$\end{document} . Lastly, it is well-known that for non-smooth convex optimization, the minimax optimal convergence rate is O ( n - 1 2 ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$O(n^{-\frac{1}{2}})$$\end{document} ; see Chapter 3, Nesterov (Reference Nesterov2004). In this sense, our algorithm is almost minimax optimal, when ε \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varepsilon $$\end{document} is very close to zero. It is well-known that Polyak–Ruppert averaging typically improves the convergence speed of a slowly convergent sequence (Ruppert, Reference Ruppert1988, Bonnabel, Reference Bonnabel2013).

4. Simulation Study

4.1. Study I: Confirmatory IFA

Table 1. Comparison of five stochastic algorithms.

In the first study, we compare the performance of four variants of the proposed method and the stochastic EM (StEM) algorithm. The five methods, including their abbreviations, are given in Table 1. For a fair comparison, the same Gibbs sampling method is used. We further explain the differences below.

  1. 1. USP is the method that we recommend. It has a step size γ t \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\gamma _t$$\end{document} close to t - 1 / 2 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$t^{-1/2}$$\end{document} , applies Polyak–Ruppert averaging, and uses a quasi-Newton update in the proximal step.

  2. 2. The USP-PPG method is the perturbed proximal gradient method that is implemented the same as the USP method except that c 1 = c 2 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$c_1 = c_2$$\end{document} so that it does not involve a quasi-Newton update. c 1 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$c_1$$\end{document} is set to be 1 without tuning in this study.

  3. 3. The USP-RM1 method is implemented the same as the USP method, except that β ( n ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\varvec{\beta }}^{(n)}$$\end{document} from the last iteration is taken as the estimator instead of applying Polyak–Ruppert averaging. This method is very similar to a Robbins–Monro algorithm, except for the update of parameters B \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {B}}$$\end{document} for the covariance matrix where constraints involve.

  4. 4. The USP-RM2 method is the same as USP-RM1, except that we set the step size γ t = 1 / t \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\gamma _t = 1/t$$\end{document} which is the asymptotic optimal step size for the Robbins–Monro algorithm (Fabian, Reference Fabian1968).

  5. 5. The implementation of the StEM algorithm is the same as USP, except for the proximal step. Instead of making stochastic gradient update, StEM obtains β ( t ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\varvec{\beta }}^{(t)}$$\end{document} by completely solving an optimization problem

    β ( t ) = arg max β B H ( ξ ( t ) , β ) + g ( β ) . \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} {\varvec{\beta }}^{(t)} = \mathop {\hbox {arg max}}\limits _{{\varvec{\beta }}\in {\mathcal {B}}} H(\varvec{\xi }^{(t)}, {\varvec{\beta }}) + g({\varvec{\beta }}). \end{aligned}$$\end{document}
    In our implementation, this optimization problem is solved by making the quasi-Newton proximal update (14) iteratively until convergence.

We consider a confirmatory IFA setting with only two factors (i.e., K = 2 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$K=2$$\end{document} ), so that an EM algorithm with sufficient numbers of quadrature points and EM steps can be used to obtain a more accurate approximation of β ^ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\hat{{\varvec{\beta }}}}$$\end{document} that will be used as the standard when comparing the five methods. We emphasize that it is important to compare the convergence speed of difference algorithms based on β ^ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\hat{{\varvec{\beta }}}}$$\end{document} rather than the true model parameters. This is because, under suitable conditions, these algorithms converge to β ^ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\hat{{\varvec{\beta }}}}$$\end{document} rather than the true model parameters. If we compare the algorithms based on the true model parameters, the difference in the convergence speed cannot be observed clearly, as the statistical error (i.e., the difference between β ^ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\hat{{\varvec{\beta }}}}$$\end{document} and the true model parameters) tends to dominate the computational errors (i.e., the difference between β ^ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\hat{{\varvec{\beta }}}}$$\end{document} and the results given by the stochastic algorithms).

More precisely, we consider sample size N = 1000 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$N = 1000$$\end{document} and the number of items J = 20 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$J = 20$$\end{document} . The design matrix Q \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {Q}}$$\end{document} is specified by the assumptions that items 1 through 5 only measure the first factor, items 6 through 10 only measure the second factor, and items 11 through 20 measure both. The intercept parameters d j \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$d_j$$\end{document} are drawn i.i.d. from the standard normal distribution, and the non-zero loading parameters are drawn i.i.d. from a uniform distribution over the interval (0.5, 1.5). The variances of the two factors are set to be 1 and the covariance is set to be 0.4. Under these parameters, 100 independent datasets are generated, based on which the five methods are compared. To ensure a fair comparison, the true parameters are used as the starting point for all the methods. In addition, 1000 iterations are run (i.e., n = 1000 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$n = 1000$$\end{document} ) for each method, instead of using an adaptive stopping criterion. For USP, USP-PPG, and StEM, the burn-in size ϖ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varpi $$\end{document} is chosen to be 500. All algorithms are implemented in C++ and run on the same platformFootnote 1 using a single core.

Figure 1. The boxplot of mean squared errors for estimated parameters from the five methods.

Figure 2. The boxplot of mean squared errors for estimated parameters from ‘USP,’ ‘USP-RM1,’ and ‘StEM’ method.

The results regarding the accuracy of the proposed methods are given in Figs. 1 and 2 that are based on the following performance metrics. Specifically, for the intercept parameters d j \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$d_j$$\end{document} , the following mean squared error (MSE) is calculated for each simulated dataset and each method,

1 J j = 1 J d ~ j - d ^ j 2 , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} \frac{1}{J}\sum _{j=1}^{J} \left( {\tilde{d}}_j - {\hat{d}}_j\right) ^2, \end{aligned}$$\end{document}

where d ^ j \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\hat{d}}_j$$\end{document} , which is treated as the global optimum, is obtained by an EM algorithm with 31 Gaussian–Hermite quadrature points per dimension, and d ~ j \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\tilde{d}}_j$$\end{document} is given by one of the five stochastic methods after 1000 iterations. Similarly, the MSEs for the loading parameters and for the correlation σ 12 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\sigma _{12}$$\end{document} between the factors are calculated, where the MSE for the loading parameters is calculated for the unrestricted ones, i.e.,

j , k 1 { q jk 0 } ( a ~ jk - a ^ jk ) 2 j , k 1 { q jk 0 } . \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} \frac{\sum _{j,k} 1_{\{q_{jk}\ne 0\}} ({\tilde{a}}_{jk} - {\hat{a}}_{jk})^2 }{\sum _{j,k} 1_{\{q_{jk}\ne 0\}} }. \end{aligned}$$\end{document}

Again, a ^ jk \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\hat{a}}_{jk}$$\end{document} is given by the EM algorithm, and a ~ jk \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\tilde{a}}_{jk}$$\end{document} is given by one of the five methods.

Figure 1 compares the accuracy of all the five methods. As we can see, the USP, USP-RM1, and StEM methods have much smaller MSEs than the USP-PPG and USP-USP-RM2 methods. Since the USP-PPG method only differs from the USP method by whether using a quasi-Newton update, the inferior performance of USP-PPG implies the importance of the second-order information in the stochastic proximal gradient update. As the USP-RM2 method only differs from USP-RM1 by their step sizes, the inferior performance of USP-RM2 is mainly due to the use of short step size.

In Fig. 2, we zoom in to further compare the USP, USP-RM1, and StEM methods. First, we see that the USP method performs the best among the three, for all the parameters. As the USP-RM1 method is the same as the USP method except for not applying Polyak–Ruppert averaging, this result suggests that averaging does improve accuracy. Moreover, the USP method and the StEM method only differ by the way the parameters are updated, where the USP method takes a quasi-Newton proximal update, while the StEM method completely solves an optimization problem. It is likely that the way parameters are updated in the USP method yields more smoothing (i.e., averaging) than the StEM, which leads to the outperformance of the USP method.

Table 2. The elapsed time (seconds) for the five methods in confirmatory IFA.

On the computational efficiency, we show in Table 2 the elapsed time for the five methods. ‘USP’, ‘USP-RM1’, ‘USP-PPG’, and ‘USP-RM2’ share similar computation time since their floating point operations per iteration are at the same level. ‘StEM’ is most time consuming because an inner loop of optimization is involved in each iteration. In summary, the proposed USP algorithm is computationally the most efficient among the five algorithms, in the sense that it achieves the highest accuracy (see Figs. 1 and 2), within a similar or smaller amount of time (see Table 2).

4.2. Study II: Exploratory IFA by Regularization

In the second study, we apply the proposed method to regularized estimation for exploratory IFA as discussed in Sect. 1.2. We consider increasing sample size N = 1000 , 2000 , 4000 , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$N=1000, 2000, 4000,$$\end{document} eighty items and five correlated latent factors (i.e., J = 80 , K = 5 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$J=80,K=5$$\end{document} ). The true loading matrix is sparse, where the items each factor loads on are given in Table 3. Similar to Study I, the intercept parameters d j \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$d_j$$\end{document} are drawn i.i.d. from the standard normal distribution, and the non-zero loading parameters a jk \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$a_{jk}$$\end{document} are drawn i.i.d. from a uniform distribution over the interval (0.5, 1.5). The elements of covariance matrix Σ = ( σ k , k ) 5 × 5 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\Sigma =(\sigma _{k,k})_{5\times 5}$$\end{document} are set to be σ k , k = 1 , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\sigma _{k,k'}=1,$$\end{document} for k = k \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$k=k'$$\end{document} and σ k , k = 0.4 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\sigma _{k,k'}=0.4$$\end{document} for k k . \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$k\ne k'.$$\end{document}

Table 3. The sparse loading structure in the data generation IFA model.

For each sample size, 50 independent datasets are generated. In the proposed algorithm, we adopt a burn-in size ϖ = 50 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varpi =50$$\end{document} and stop based on the criterion discussed in Sect. 2, where the stopping threshold is set to be 10 - 3 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$10^{-3}$$\end{document} . A decreasing penalty parameter λ N = log J / N \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\lambda _N = \sqrt{\log J /N}$$\end{document} is used to ensure estimation consistency (Chapter 6, Bühlmann & van de Geer, Reference Bühlmann and van de Geer2011). Other implementation details can be found in Sect. 2.2. The algorithm in this example is implemented in C++ and is run on the same platform as in Study I. Although a regularized EM algorithm (Sun et al., Reference Sun, Chen, Liu, Ying and Xin2016) can also solve this problem, it suffers from a very high computational cost. Due to the five-dimensional numerical integrals involved, it takes a few hours to fit one dataset. We thus do not consider it here.

We focus on the accuracy in the estimation of the loading matrix A = ( a jk ) J × K \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {A}} = (a_{jk})_{J\times K}$$\end{document} . Note that although the rotational indeterminacy issue is resolved in this regularized estimator, the loading matrix can still only be identified up to column swapping. That is, two estimates of the loading matrix have the same objective function value, if one can be obtained by swapping the columns of the other. The following mean-squared-error measure is used that takes into account column swapping of the loading matrix

(22) min A P ( A ~ ) 1 JK A - A F 2 , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} \min _{{\mathbf {A}}'\in {\mathcal {P}}(\tilde{{\mathbf {A}}})}\left\{ \frac{1}{JK}\Vert {\mathbf {A}}' - {\mathbf {A}}\Vert _F^2\right\} , \end{aligned}$$\end{document}

where · F \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\Vert \cdot \Vert _F$$\end{document} is the Frobenius norm, A \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {A}}$$\end{document} is the true loading matrix, A ~ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\tilde{{\mathbf {A}}}$$\end{document} is the output of Algorithm 1, and P ( A ~ ) \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathcal {P}}(\tilde{{\mathbf {A}}})$$\end{document} denotes the set of J × K \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$J\times K$$\end{document} matrices that can be obtained by swapping the columns of A ~ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\tilde{{\mathbf {A}}}$$\end{document} .

Results are given in Tables 4 and 5. In Table 4, we see that the MSE for the loading matrix is quite small and decreases as the sample size grows, suggesting that consistency of the regularized estimator. In Table 5, the quantiles of time consumption under different sample sizes are given, which suggests the computational efficiency of the proposed method.

Table 4. The mean squared errors for estimated loading parameters in exploratory IFA with L 1 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$L_1$$\end{document} regularization.

Table 5. The elapsed time (seconds) for exploratory IFA with L 1 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$L_1$$\end{document} regularization.

4.3. Study III: Restricted LCA

In this study, we apply the proposed method to the estimation of a restricted latent class model as discussed in Sect. 1.3, where the optimization involves complex inequality constraints. Specifically, data are from a Deterministic Input, Noisy ‘And’ gate (DINA) model (Junker & Sijtsma, Reference Junker and Sijtsma2001) that is a special restricted latent class model. Note that the DINA assumptions are only used in the data generation. We solve optimization (10) which is based on a general restricted latent class model considered in Xu (Reference Xu2017) instead of the DINA model, mimicking the practical situation when the parametric form is unknown.

We consider a test consisting of twenty items (i.e., J = 20 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$J = 20$$\end{document} ) that measure four binary attributes (i.e., K = 4 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$K=4$$\end{document} ). Three sample sizes are considered, including N = 1000 , 2000 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$N=1000, 2000$$\end{document} , and 4000. The design matrix Q \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {Q}}$$\end{document} is given in Table 6. In addition, the guessing and slipping parameters s j \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$s_j$$\end{document} and g j \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$g_j$$\end{document} of the DINA model are drawn i.i.d. from a uniform distribution over the interval (0.05, 0.2), which gives the values of θ j , α \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\theta _{j, \varvec{\alpha }}$$\end{document} . That is,

θ j , α = log ( ( 1 - s j ) / s j ) , if α q j , log ( g j / ( 1 - g j ) ) , otherwise. \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} \theta _{j, \varvec{\alpha }} =\left\{ \begin{array}{cl} \log ( (1-s_j)/s_j), &{} ~\text{ if }~ \varvec{\alpha } \succeq {\mathbf {q}}_{j}, \\ \log ( g_j/(1-g_j)), &{} ~\text{ otherwise. } \end{array}\right. \end{aligned}$$\end{document}

Finally, we let ν α = 0 , \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\nu _{\varvec{\alpha }} = 0,$$\end{document} for all α { 0 , 1 } K \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varvec{\alpha } \in \{0, 1\}^K$$\end{document} , so that P ( ξ = α ) = 1 / 2 K \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbb {P}}(\varvec{\xi } = \varvec{\alpha }) = 1/2^K$$\end{document} . According to the results in Xu (Reference Xu2017), the model parameters are identifiable, given the Q \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {Q}}$$\end{document} -matrix in Table 6.

Table 6. The design matrix Q \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {Q}}$$\end{document} for the restricted LCA model.

For each sample size, 50 independent datasets are generated. The proposed algorithm adopts a burn-in size ϖ = 50 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varpi =50$$\end{document} and stops based on the criterion discussed in Sect. 2, where the stopping threshold is set to be 10 - 3 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$10^{-3}$$\end{document} . Other implementation details can be found in Sect. 2.3. The following metrics are used to evaluate the estimation accuracy. For item parameters θ j , α \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\theta _{j, \varvec{\alpha }}$$\end{document} , the MSE is calculated as

1 J × 2 K j = 1 J α { 0 , 1 } K θ ~ j , α - θ j , α 2 . \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} \frac{1}{J\times 2^K}\sum _{j=1}^J\sum _{\varvec{\alpha }\in \{0,1\}^K}\left( {\tilde{\theta }}_{j,\varvec{\alpha }}-\theta _{j,\varvec{\alpha }}\right) ^2. \end{aligned}$$\end{document}

For structural parameters ν α \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\nu _{\varvec{\alpha }}$$\end{document} , the MSE is calculated as

1 2 K - 1 α { 0 , 1 } K , α 0 ν ~ α - ν α 2 . \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\begin{aligned} \frac{1}{2^K-1}\sum _{\varvec{\alpha }\in \{0,1\}^K,\ \varvec{\alpha }\ne {\varvec{0}}}\left( {\tilde{\nu }}_{\varvec{\alpha }} - \nu _{\varvec{\alpha }}\right) ^2. \end{aligned}$$\end{document}

Our results are given in Tables 7 and 8. As we can see, the estimation becomes more accurate as the sample size increases for both sets of parameters. It confirms that the current model is identifiable as suggested by Xu (Reference Xu2017) and thus can be consistently estimated.

Table 7. The MSE for item parameters θ j , α \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\theta _{j, \varvec{\alpha }}$$\end{document} in the restricted latent class model.

Table 8. The MSE for structural parameters ν α \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\nu _{\varvec{\alpha }}$$\end{document} in the restricted latent class model.

5. Concluding Remarks

In this paper, a unified stochastic proximal optimization framework is proposed for the computation of latent variable model estimation. This framework is very general that applies to a wide range of estimators for almost all commonly used latent variable models. Comparing with existing stochastic optimization methods, the proposed method not only solves a wider range of problems including regularized and constrained estimators, but also is computationally more efficient. Theoretical properties of the proposed method are established. These results suggest that the convergence speed of the proposed method is almost optimal in the minimax sense.

The power of the proposed method is shown via three examples, including confirmatory IFA, exploratory IFA by regularized estimation, and restricted latent class analysis. Specifically, the proposed method is compared with several stochastic optimization algorithms, including a stochastic-EM algorithm and a Robbin–Monro algorithm with MCMC sampling, in the simulation study of confirmatory IFA, where there is no complex constraint or penalty. Using the same starting point and the same number of iterations, the proposed one is always more accurate than its competitors. The simulation studies on exploratory IFA and restricted latent class analysis further show the power of the proposed method for handling optimization problems with non-smooth penalties and complex inequality constraints.

The implementation of the proposed algorithm involves several tuning parameters. First, we need to choose a step size γ t \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\gamma _t$$\end{document} . Our theoretical results suggest that γ t = t - 0.5 - ϵ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\gamma _t =t^{-0.5 - \epsilon }$$\end{document} for any ϵ ( 0 , 0.5 ] \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\epsilon \in (0,0.5]$$\end{document} , and a smaller ϵ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\epsilon $$\end{document} leads to faster convergence. In practice, we suggest to set γ t = t - 0.51 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\gamma _t =t^{-0.51}$$\end{document} that performs well in all our simulations. This choice of step size is very different from the choice of γ t = t - 1 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\gamma _t =t^{-1}$$\end{document} in the MCMC stochastic approximation algorithms. Second, a burn-in size ϖ \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\varpi $$\end{document} is needed. The burn-in in the proposed algorithm is similar to the burn-in in MCMC algorithms. It does not affect the asymptotic convergence of the algorithm but improves the finite sample performance. In practice, the burn-in size can be decided similarly as in MCMC algorithms by monitoring the parameter updates using trace plots. Third, two positive constraints c 1 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$c_1$$\end{document} and c 2 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$c_2$$\end{document} are needed to regularize the second-order matrix in the scaled proximal update. Depending on the scale of each particular problem, we suggest to choose c 1 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$c_1$$\end{document} to be sufficiently small and c 2 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$c_2$$\end{document} to be sufficiently large. It is found that the performance of our algorithm is not sensitive to their choices. Finally, a stopping criterion is needed. We suggest to stop the iterative update by monitoring a window of successive differences in parameter updates.

The proposed framework may be improved from several aspects that are left for future investigation. First, the sampling strategy in the stochastic step needs further investigation. Although in theory any reasonable MCMC sampler can yield the convergence of the algorithm, a good sampler will lead to superior finite sample performance. More sophisticated MCMC algorithms need to be investigated regarding their performance under the proposed framework. Second, methods for parallel and distributed computing need to be developed. As we can see, many steps of Algorithm 1 can be performed independently. This enables us to design parallel and/or distributed computing systems for solving large-scale and/or distributed versions of latent variable model estimation problems (e.g., fitting models for assessment data from online learning platforms and large-scale mental health records). Finally, the performance of the proposed method under other latent variable models needs to be investigated. For example, the proposed method can also be applied to latent stochastic process models (e.g., Chow et al., Reference Chow, Lu, Sherwood and Zhu2016; Chen & Zhang, Reference Chen and Zhang2020) that are useful for analyzing intensive longitudinal data. These models bring additional challenges, as stochastic processes need to be sampled in the stochastic step of our algorithm.

In summary, the proposed method is computationally efficient, theoretically solid, and applicable to a broad range of latent variable model inference problems. Like the EM algorithm as the standard tool for low-dimensional latent variable models, we believe that the proposed method may potentially serve as the standard approach to the estimation of high-dimensional latent variable models.

Acknowledgements

We thank the editor, associate editor, and three anonymous referees for their careful review and valuable comments. Zhang’s research is partially supported by Shanghai Science and Technology Commission Science and Technology Project (22YF1411100).

Footnotes

Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/s11336-022-09863-9.

Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 CPU: 2.6 GHz 6-Core Intel Core i7; RAM: 16 GB 2400 MHz DDR4.

References

Andersen, E. B., Conditional inference and models for measuring Copenhagen, Denmark Mentalhygiejnisk ForlagGoogle Scholar
Andrieu, C., Moulines, É (2006). On the ergodicity properties of some adaptive MCMC algorithms The Annals of Applied Probability 16 14621505 10.1214/105051606000000286CrossRefGoogle Scholar
Andrieu, C., Moulines, É Priouret, P., (2005). Stability of stochastic approximation under verifiable conditions SIAM Journal on Control and Optimization 44 283312 10.1137/S0363012902417267CrossRefGoogle Scholar
Atchadé, Y. F., Fort, G., Moulines, E., (2017). On perturbed proximal gradient algorithms The Journal of Machine Learning Research 18 310342Google Scholar
Bartholomew, D. J., Knott, M., Moustaki, I Latent variable models and factor analysis: A unified approach Hoboken, NJ Wiley 10.1002/9781119970583CrossRefGoogle Scholar
Becker, S., & Le Cun, Y. (1988). Improving the convergence of back-propagation learning with second order methods. In Touretzky, T. S. D. Hinton, G. (Ed.), Proceedings of the 1988 connectionist models summer school (pp. 29–37). San Mateo: Morgan Kaufmann.Google Scholar
Béguin, A. A., Glas, C. A., (2009). MCMC estimation and some model-fit analysis of multidimensional IRT models Psychometrika 66 541561 10.1007/BF02296195CrossRefGoogle Scholar
Benveniste, A., Priouret, P., Métivier, M Adaptive algorithms and stochastic approximations New York, NY Springer 10.1007/978-3-642-75894-2CrossRefGoogle Scholar
Bertsekas, D. P., Gallager, R. G., Humblet, P Data networks (1992). Upper Saddle River, NJ Prentice-Hall InternationalGoogle Scholar
Birnbaum, A., Lord, F., Novick, M., (1968). Some latent trait models and their use in inferring an examinee’s ability, Statistical Theories of Mental Test Scores Boston, MA Addison-Wesley 397479Google Scholar
Bock, R. D., Aitkin, M., (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm Psychometrika 46 443459 10.1007/BF02293801CrossRefGoogle Scholar
Bollen, K. A., Structural equations with latent variables Hoboken, NJ Wiley 10.1002/9781118619179CrossRefGoogle Scholar
Bolt, D. M., Lall, V. F., (2003). Estimation of compensatory and noncompensatory multidimensional item response models using Markov chain Monte Carlo Applied Psychological Measurement 27 395414 10.1177/0146621603258350CrossRefGoogle Scholar
Bonnabel, S., (2013). Stochastic gradient descent on Riemannian manifolds IEEE Transactions on Automatic Control 58 22172229 10.1109/TAC.2013.2254619CrossRefGoogle Scholar
Boyd, S., Boyd, S. P., Vandenberghe, L Convex optimization Cambridge, UK Cambridge University Press 10.1017/CBO9780511804441Google Scholar
Bühlmann, P., van de Geer, S Statistics for High-Dimensional Data (2011). Berlin, Heidelberg Springer, Berlin Heidelberg 10.1007/978-3-642-20192-9CrossRefGoogle Scholar
Cai, L., (2003). High-dimensional exploratory item factor analysis by a Metropolis-Hastings Robbins-Monro algorithm Psychometrika 75 3357 10.1007/s11336-009-9136-xCrossRefGoogle Scholar
Cai, L., (2010). Metropolis-Hastings Robbins-Monro algorithm for confirmatory item factor analysis Journal of Educational and Behavioral Statistics 35 307335 10.3102/1076998609353115CrossRefGoogle Scholar
Carroll, R. J., Ruppert, D., Stefanski, L. A., Crainiceanu, C. M., Measurement error in nonlinear models: A modern perspective Boca Raton, FL CRC Press 10.1201/9781420010138Google Scholar
Celeux, G., (1985). The SEM algorithm: A probabilistic teacher algorithm derived from the EM algorithm for the mixture problem Computational Statistics Quarterly 2 7382Google Scholar
Chen, Y., Li, X., Liu, J., Ying, Z., (2018). Robust measurement via a fused latent and graphical item response theory model Psychometrika 83 538562 29532405 10.1007/s11336-018-9610-4CrossRefGoogle Scholar
Chen, Y., Li, X., Zhang, S., (2019). Joint maximum likelihood estimation for high-dimensional exploratory item factor analysis Psychometrika 84 124146 30456747 10.1007/s11336-018-9646-5CrossRefGoogle ScholarPubMed
Chen, Y., Li, X., Zhang, S., (2020). Structured latent factor analysis for large-scale data: Identifiability, estimability, and their implications Journal of the American Statistical Association 115 17561770 10.1080/01621459.2019.1635485CrossRefGoogle Scholar
Chen, Y., Liu, J., Xu, G., Ying, Z., (2015). Statistical analysis of Q-matrix based diagnostic classification models Journal of the American Statistical Association 110 850866 26294801 10.1080/01621459.2014.934827CrossRefGoogle Scholar
Chen, Y., & Zhang, S. (2020a). Estimation methods for item factor analysis: An overview. In Zhao, Y. & Chen, D. G. (Eds.), Modern statistical methods for health research. New York, NY: Springer.CrossRefGoogle Scholar
Chen, Y., Zhang, S., (2020). A latent Gaussian process model for analysing intensive longitudinal data British Journal of Mathematical and Statistical Psychology 73 237260 31418456 10.1111/bmsp.12180CrossRefGoogle ScholarPubMed
Chow, S. M., Lu, Z., Sherwood, A., Zhu, H., (2016). Fitting nonlinear ordinary differential equation models with random effects and unknown initial conditions using the stochastic approximation expectation-maximization (saem) algorithm Psychometrika 81 102134 25416456 10.1007/s11336-014-9431-zCrossRefGoogle ScholarPubMed
Clogg, C. C., Arminger, G., Clogg, C. C., Sobel, M. E., (1995). Latent class models Handbook of Statistical Modeling for the Social and Behavioral Sciences Boston, MA Springer, US 311359 10.1007/978-1-4899-1292-3_6CrossRefGoogle Scholar
de la Torre, J., (2011). The generalized DINA model framework Psychometrika 76 179199 10.1007/s11336-011-9207-7CrossRefGoogle Scholar
Dempster, A. P., Laird, N. M., Rubin, D. B., (1977). Maximum likelihood from incomplete data via the EM algorithm Journal of the Royal Statistical Society: Series B (Statistical Methodology) 39 138CrossRefGoogle Scholar
Duchi, J. C., Ruan, F., (2018). Stochastic methods for composite and weakly convex optimization problems SIAM Journal on Optimization 28 32293259 10.1137/17M1135086CrossRefGoogle Scholar
Dunson, D. B., (2000). Bayesian latent variable models for clustered mixed outcomes Journal of the Royal Statistical Society: Series B (Statistical Methodology) 62 355366 10.1111/1467-9868.00236CrossRefGoogle Scholar
Dunson, D. B., (2003). Dynamic latent trait models for multidimensional longitudinal data Journal of the American Statistical Association 98 555563 10.1198/016214503000000387CrossRefGoogle Scholar
Edwards, M. C., (2010). A Markov chain Monte Carlo approach to confirmatory item factor analysis Psychometrika 75 474497 10.1007/s11336-010-9161-9CrossRefGoogle Scholar
Embretson, S. E., Reise, S. P., Item response theory for psychologists Mahwah, NJ Lawrence Erlbaum Associates PublishersGoogle Scholar
Fabian, V., (1968). On asymptotic normality in stochastic approximation The Annals of Mathematical Statistics 39 13271332 10.1214/aoms/1177698258CrossRefGoogle Scholar
Floudas, C. A., Nonlinear and mixed-integer optimization: Fundamentals and applications Oxford, UK Oxford University Press 10.1093/oso/9780195100563.001.0001Google Scholar
Fort, G., Moulines, E., Priouret, P., (2011). Convergence of adaptive and interacting Markov chain Monte Carlo algorithms Annals of Statistics 39 32623289 10.1214/11-AOS938CrossRefGoogle Scholar
Fort, G., Moulines, É Schreck, A., Vihola, M., (2016). Convergence of Markovian stochastic approximation with discontinuous dynamics SIAM Journal on Control and Optimization 54 866893 10.1137/140962723CrossRefGoogle Scholar
Friedman, J., Hastie, T., Tibshirani, R The elements of statistical learning New York, NY SpringerGoogle Scholar
Ghosh, M., (1995). Inconsistent maximum likelihood estimators for the Rasch model Statistics and Probability Letters 23 165170 10.1016/0167-7152(94)00109-LCrossRefGoogle Scholar
Gilks, W. R., Wild, P., (1992). Adaptive rejection sampling for Gibbs sampling. Journal of the Royal Statistical Society Series C (Applied Statistics) 41 337348Google Scholar
Goldfarb, D., Idnani, A., (1983). A numerically stable dual method for solving strictly convex quadratic programs Mathematical Programming 27 133 10.1007/BF02591962CrossRefGoogle Scholar
Gu, M. G., Kong, F. H., (1998). A stochastic approximation algorithm with Markov chain Monte-Carlo method for incomplete data estimation problems Proceedings of the National Academy of Sciences of the United States of America 95 72707274 9636138 22587 10.1073/pnas.95.13.7270CrossRefGoogle ScholarPubMed
Haberman, S. J., (1977). Maximum likelihood estimates in exponential response models The Annals of Statistics 5 815841 10.1214/aos/1176343941CrossRefGoogle Scholar
Haberman, S. J. (2004). Joint and conditional maximum likelihood estimation for the Rasch model for binary responses. ETS Research Report Series RR-04-20.CrossRefGoogle Scholar
Hsiao, C Analysis of panel data Cambridge, UK Cambridge University Press 10.1017/CBO9781139839327Google Scholar
Ip, E. H., (2002). On single versus multiple imputation for a class of stochastic algorithms estimating maximum likelihood Computational Statistics 17 517524 10.1007/s001800200124CrossRefGoogle Scholar
Jacobucci, R., Grimm, K. J., McArdle, J. J., (2016). Regularized structural equation modeling Structural Equation Modeling: A Multidisciplinary Journal 23 555566 10.1080/10705511.2016.1154793CrossRefGoogle ScholarPubMed
Jennrich, R. I., (2002). A simple general method for oblique rotation Psychometrika 67 719 10.1007/BF02294706CrossRefGoogle Scholar
Jöreskog, K. G., Moustaki, I., (2001). Factor analysis of ordinal variables: A comparison of three approaches Multivariate Behavioral Research 36 347387 26751181 10.1207/S15327906347-387CrossRefGoogle ScholarPubMed
Junker, B. W., Sijtsma, K., (2001). Cognitive assessment models with few assumptions, and connections with nonparametric item response theory Applied Psychological Measurement 25 258272 10.1177/01466210122032064CrossRefGoogle Scholar
Lee, J. D., Sun, Y., Saunders, M. A., (2014). Proximal Newton-type methods for minimizing composite functions SIAM Journal on Optimization 24 14201443 10.1137/130921428CrossRefGoogle Scholar
Lindstrøm, J. C., Dahl, F. A., (2020). Model selection with Lasso in multi-group structural equation models Structural Equation Modeling: A Multidisciplinary Journal 27 3342 10.1080/10705511.2019.1638262CrossRefGoogle Scholar
Little, R. J., Rubin, D. B., Statistical analysis with missing data Hoboken, NJ WileyCrossRefGoogle Scholar
Louis, T. A., (1982). Finding the observed information matrix when using the EM algorithm Journal of the Royal Statistical Society: Series B (Statistical Methodology) 44 226233CrossRefGoogle Scholar
Magis, D., Tuerlinckx, F., De Boeck, P., (2015). Detection of differential item functioning using the Lasso approach Journal of Educational and Behavioral Statistics 40 111135 10.3102/1076998614559747CrossRefGoogle Scholar
Nesterov, Y Introductory lectures on convex optimization: A basic course Boston, MA Kluwer Academic Publishers 10.1007/978-1-4419-8853-9CrossRefGoogle Scholar
Neyman, J., Scott, E. L., (1948). Consistent estimates based on partially consistent observations Econometrica 16 132 10.2307/1914288CrossRefGoogle Scholar
Nielsen, S. F., (2000). The stochastic EM algorithm: Estimation and asymptotic results Bernoulli 6 457489 10.2307/3318671CrossRefGoogle Scholar
Parikh, N., Boyd, S., (2014). Proximal algorithms Foundations and Trends® in Optimization 1 127239 10.1561/2400000003CrossRefGoogle Scholar
Polyak, B. T., Juditsky, A. B., (1992). Acceleration of stochastic approximation by averaging SIAM Journal on Control and Optimization 30 838855 10.1137/0330046CrossRefGoogle Scholar
Rabe-Hesketh, S., Skrondal, A Generalized latent variable modeling: Multilevel, longitudinal, and structural equation models New York, NY Chapman and Hall/CRCGoogle Scholar
Reckase, M (2009). Multidimensional item response theory New York, NY Springer 10.1007/978-0-387-89976-3CrossRefGoogle Scholar
Robbins, H. (1956). An empirical Bayes approach to statistics. In Neyman, J. (Ed.), Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability (pp. 157–163). Berkeley, CA: University of California Press.Google Scholar
Robbins, H., Monro, S., (1951). A stochastic approximation method The Annals of Mathematical Statistics 22 400407 10.1214/aoms/1177729586CrossRefGoogle Scholar
Rockafellar, R. T., Wets, R. J., Variational analysis New York, NY Springer 10.1007/978-3-642-02431-3CrossRefGoogle Scholar
Rupp, A. A., Templin, J., Henson, R. A., (2010). Diagnostic measurement: Theory, methods, and applications New York, NY Guilford PressGoogle Scholar
Ruppert, D. (1988). Efficient estimators from a slowly convergent Robbins-Monro procedure (p. 781). School of Oper: Res and Ind Eng, Cornell Univ, Ithaca, NY, Tech Rep No.Google Scholar
Schmidt, M., Roux, N. L., Bach, F. R., (2011). Convergence rates of inexact proximal-gradient methods for convex optimization Advances in Neural Information Processing Systems 24 Red Hook, NY Curran Associates Inc 14581466Google Scholar
Spall, J. C., Introduction to Stochastic Search and Optimization Hoboken, NJ Wiley 10.1002/0471722138CrossRefGoogle Scholar
Sun, J., Chen, Y., Liu, J., Ying, Z., Xin, T., (2003). Latent variable selection for multidimensional item response theory models via L1\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$L_1$$\end{document} regularization Psychometrika 81 921939 27699561 10.1007/s11336-016-9529-6CrossRefGoogle Scholar
Tibshirani, R., (2016). Regression shrinkage and selection via the Lasso Journal of the Royal Statistical Society: Series B (Statistical Methodology) (1996). 58 267288CrossRefGoogle Scholar
Turlach, B., Weingessel, A., & Moler, C. (2019). quadprog: Functions to solve quadratic programming problems [Computer software manual]. (R package version 1.5-8).Google Scholar
Tutz, G., Schauberger, G., (2015). A penalty approach to differential item functioning in Rasch models Psychometrika 80 2143 24297435 10.1007/s11336-013-9377-6CrossRefGoogle ScholarPubMed
Van De Schoot, R., Hoijtink, H., Deković, M., (2010). Testing inequality constrained hypotheses in SEM models Structural Equation Modeling 17 443463 10.1080/10705511.2010.489010CrossRefGoogle Scholar
Vasdekis, V. G., Cagnone, S., Moustaki, I., (2012). A composite likelihood inference in latent variable models for ordinal longitudinal responses Psychometrika 77 425441 27519774 10.1007/s11336-012-9264-6CrossRefGoogle ScholarPubMed
von Davier, M., Lee, Y. S., Handbook of Diagnostic Classification Models New York, NY Springer 10.1007/978-3-030-05584-4CrossRefGoogle Scholar
Wu, C. J., (1983). On the convergence properties of the EM algorithm The Annals of Statistics 11 95103 10.1214/aos/1176346060CrossRefGoogle Scholar
Xu, G., (2017). Identifiability of restricted latent class models with binary responses The Annals of Statistics 45 675707 10.1214/16-AOS1464CrossRefGoogle Scholar
Zhang, C. H., (2003). Compound decision theory and empirical Bayes methods The Annals of Statistics 31 379390 10.1214/aos/1051027872CrossRefGoogle Scholar
Zhang, H., Chen, Y., Li, X., (2020). A note on exploratory item factor analysis by singular value decomposition Psychometrika 85 358372 32451743 7385012 10.1007/s11336-020-09704-7CrossRefGoogle ScholarPubMed
Zhang, S., Chen, Y., Liu, Y., (2020). An improved stochastic EM algorithm for large-scale full-information item factor analysis British Journal of Mathematical and Statistical Psychology 73 4471 30511445 10.1111/bmsp.12153CrossRefGoogle ScholarPubMed
Zou, H., Hastie, T., (2005). Regularization and variable selection via the elastic net Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 301320 10.1111/j.1467-9868.2005.00503.xCrossRefGoogle Scholar
Figure 0

Table 1. Comparison of five stochastic algorithms.

Figure 1

Figure 1. The boxplot of mean squared errors for estimated parameters from the five methods.

Figure 2

Figure 2. The boxplot of mean squared errors for estimated parameters from ‘USP,’ ‘USP-RM1,’ and ‘StEM’ method.

Figure 3

Table 2. The elapsed time (seconds) for the five methods in confirmatory IFA.

Figure 4

Table 3. The sparse loading structure in the data generation IFA model.

Figure 5

Table 4. The mean squared errors for estimated loading parameters in exploratory IFA with L1\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$L_1$$\end{document} regularization.

Figure 6

Table 5. The elapsed time (seconds) for exploratory IFA with L1\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$L_1$$\end{document} regularization.

Figure 7

Table 6. The design matrix Q\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\mathbf {Q}}$$\end{document} for the restricted LCA model.

Figure 8

Table 7. The MSE for item parameters θj,α\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\theta _{j, \varvec{\alpha }}$$\end{document} in the restricted latent class model.

Figure 9

Table 8. The MSE for structural parameters να\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\nu _{\varvec{\alpha }}$$\end{document} in the restricted latent class model.

Supplementary material: File

Zhang and Chen supplementary material

Zhang and Chen supplementary material
Download Zhang and Chen supplementary material(File)
File 709.4 KB