Regularized Variational Estimation for Exploratory Item Factor Analysis

April E. Cho; Jiaying Xiao; Chun Wang; Gongjun Xu

doi:10.1007/s11336-022-09874-6

Regularized Variational Estimation for Exploratory Item Factor Analysis

Published online by Cambridge University Press: 01 January 2025

Chun Wang and

April E. Cho: Affiliation:
University of Michigan
Jiaying Xiao: Affiliation:
University of Washington
Chun Wang*: Affiliation:
University of Washington
Gongjun Xu*: Affiliation:
University of Michigan
*: Email: [email protected]
Email: [email protected]

Article contents

Abstract
Introduction
Variational Estimation for MIRT
Regularized Estimation of Loading Structure
Simulation Study
Real Data Analysis
Discussions
Footnotes
References

Rights & Permissions

Abstract

Item factor analysis (IFA), also known as Multidimensional Item Response Theory (MIRT), is a general framework for specifying the functional relationship between respondents’ multiple latent traits and their responses to assessment items. The key element in MIRT is the relationship between the items and the latent traits, so-called item factor loading structure. The correct specification of this loading structure is crucial for accurate calibration of item parameters and recovery of individual latent traits. This paper proposes a regularized Gaussian Variational Expectation Maximization (GVEM) algorithm to efficiently infer item factor loading structure directly from data. The main idea is to impose an adaptive L1\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$L_1$$\end{document}-type penalty to the variational lower bound of the likelihood to shrink certain loadings to 0. This new algorithm takes advantage of the computational efficiency of GVEM algorithm and is suitable for high-dimensional MIRT applications. Simulation studies show that the proposed method accurately recovers the loading structure and is computationally efficient. The new method is also illustrated using the National Education Longitudinal Study of 1988 (NELS:88) mathematics and science assessment data.

Keywords

latent variable selection multidimensional item response theory variational inference expectation-maximization lasso adaptive lasso

Type: Original Research
Information: Psychometrika , Volume 89 , Issue 1 , March 2024 , pp. 347 - 375

DOI: https://doi.org/10.1007/s11336-022-09874-6 [Opens in a new window]
Copyright: Copyright © 2022 The Author(s) under exclusive licence to The Psychometric Society

Introduction

Full Information Item factor analysis (IFA), known as factor analysis of ordered categorical (such as binary) item-level data, has been a useful tool to explore the latent structure underlying educational and psychological tests (Bock, Gibbons, & Muraki, Reference Bock, Gibbons and Muraki1988). IFA provides a wealth of information regarding the characteristics of the items and tests, which are important to ensure reliability and validity of a measure. As IFA deals with item-level responses, it is also considered as multidimensional item response theory (MIRT) (Embretson & Reise, Reference Embretson and Reise2000; Reckase, Reference Reckase2009)

The widely used multidimensional 2-parameter logistic (M2PL) model assumes item response function of the ith individual to the jth item as

(1)

\begin{matrix} P (Y_{ij} = 1 ∣ θ_{i}) = \frac{exp (α_{j}^{⊤} θ_{i} - b_{j})}{1 + exp (α_{j}^{⊤} θ_{i} - b_{j})}, \end{matrix}

where there are N subjects who respond to J items independently with binary response variables $Y_{ij},$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Y_{ij},$$\end{document} for $i = 1, \dots, N$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i=1,\dots ,N$$\end{document} and $j = 1, \dots, J$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$j=1,\dots ,J$$\end{document} . $α_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{\alpha }}_j$$\end{document} denotes a K-dimensional vector of item discrimination parameters for the jth item and $b_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$b_j$$\end{document} denotes the corresponding item difficulty parameter. $θ_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{\theta }}_i$$\end{document} denotes the K-dimensional vector of latent ability for student i. $α_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{\alpha }}_j$$\end{document} may contain structural 0’s implying that item j does not measure (hence not load on) certain factors. When both $α_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{\alpha }}_j$$\end{document} and $θ_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{\theta }}_i$$\end{document} are unidimensional, the 2PL model and one-factor categorical factor analysis model are mathematically equivalent (Takane & De Leeuw, Reference Takane and De Leeuw1987; Wirth & Edwards, Reference Wirth and Edwards2007). Another popular MIRT model that is often suitable for multiple-choice binary response items is the multidimensional 3-parameter logistic (M3PL) model. It includes an additional parameter $c_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c_j$$\end{document} to quantify guessing probability of the jth item. Hence, the item response function is expressed as

(2)

\begin{matrix} P (Y_{ij} = 1 ∣ θ_{i}) = c_{j} + (1 - c_{j}) \frac{exp (α_{j}^{⊤} θ_{i} - b_{j})}{1 + exp (α_{j}^{⊤} θ_{i} - b_{j})} . \end{matrix}

Although the inclusion of the guessing parameter makes the model more flexible, it no longer belongs to the exponential family and its estimation becomes much more challenging (Thissen & Wainer, Reference Thissen and Wainer1982; Yen, Reference Yen1987).

In an exploratory IFA, the item factor loading structure that is reflected by the systematic 0’s in $α_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{\alpha }}_j$$\end{document} is unknown. Identifying the loading structure, which is equivalent to the sparsity structure of $α_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{\alpha }}_j$$\end{document} , is crucial not only for accurate calibration of item parameters and recovery of individual latent traits, but also for understanding the construct validity of a measure. Traditional approaches for identifying item factor loading structure proceed in two steps: (1) allowing all item factor loadings to be freely estimated, subject to identifiability constraints; and (2) conducting a post-hoc rotation (Browne, Reference Browne2001a). Most software packages use varimax (Kaiser, Reference Kaiser1958) for orthogonal rotation or promax (Hendrickson & White,Reference Hendrickson and White1966) for oblique rotation by default. Other popular methods include, for instance, the CF-Quartimax rotation (Browne, Reference Browne2001a). While these rotation methods intend to produce a near-simple structure, an arbitrary cutoff for the rotated factor loadings is often needed. Rotation methods that encourage sparse solutions have also been developed in Jennrich (Reference Jennrich2004; Reference Jennrich2006) using the component loss functions for orthogonal and oblique rotations.

To avoid setting subjective cutoffs, Sun, Chen, Liu, Ying, and Xin (Reference Sun, Chen, Liu, Ying and Xin2016) recently proposed to formulate the problem of estimating the loading structure in MIRT as a latent variable selection problem. Specifically, for each item, a set of latent traits influencing the distribution of the responses are selected by the $L_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L_1$$\end{document} -regularized regression. The $L_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L_1$$\end{document} -regularized regression, also known as the constrained least absolute shrinkage and selection operator (Lasso) (Tibshirani, Reference Tibshirani1996), has received much attention for solving variable selection problems for both linear and generalized linear models (Friedman, Hastie, & Tibshirani, Reference Friedman, Hastie and Tibshirani2010). The principle idea is to penalize the factor loadings toward zero if the corresponding latent traits are not associated with an item. This leads to correctly estimating an optimal nonzero factor loading structure, instead of setting subjective cutoffs. This approach also has the advantage over the information criterion-based model selection methods in terms of the computational cost because it simultaneously estimates both loading structure and model parameters. Despite its appeal, the computation is still quite challenging in MIRT model due to its intractable marginal likelihood function that involves high-dimensional integration. For parameter estimation, Sun et al. (Reference Sun, Chen, Liu, Ying and Xin2016) used direct numerical approximation of the likelihood in the iterative expectation-maximization (EM) procedure, which can be computationally inefficient especially in higher dimensions. Specifically, they showed that the computation time for the latent variable selection with dimension $K = 3$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K=3$$\end{document} is about 30 minutes for the first penalization tuning parameter $λ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\lambda $$\end{document} and additional 10 minutes for the subsequent $λ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\lambda $$\end{document} s. Considering that multiple $λ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\lambda $$\end{document} s have to be used for the latent variable selection via regularization, it can take a few hours to estimate a test structure for a single dataset with high dimensions.

Indeed, developing efficient estimation algorithms for MIRT parameter estimation has always been a productive research topic. A number of methods have been proposed to deal with the computational challenge (Rabe-Hesketh, Skrondal, & Pickles, Reference Rabe-Hesketh, Skrondal and Pickles2005; von Davier & Sinharay, Reference von Davier and Sinharay2010). The first one is the adaptive Gaussian quadrature method. Although the number of quadrature points per dimension could be small, the total number of quadrature points still increases exponentially with the number of dimensions. Moreover, an extra step is needed to compute the posterior mode and variance of latent factors in each iteration, which adds additional computation costs (Pinheiro & Bates, Reference Pinheiro and Bates1995). The second one is the Monte Carlo techniques. This family of methods include, for instance, the Monte Carlo EM algorithm (McCulloch, Reference McCulloch1997; C. Wang & Xu, Reference Wang and Xu2015), stochastic EM algorithm (von Davier & Sinharay, Reference von Davier and Sinharay2010; S. Zhang, Chen, & Liu, Reference Zhang, Chen and Liu2020), or Metropolis-Hastings Robbins-Monro algorithm (Cai, Reference Cai2010b; Reference Caia). These methods circumvent intractable integrations by sampling from the posterior distributions; however, they may be still computationally intensive for complicated high-dimensional models, as a large Monte Carlo sample size is typically needed and the posterior distributions usually do not have a closed form. Fully Bayesian estimation methods, such as Markov chain Monte Carlo (MCMC) (Albert, Reference Albert1992; Patz & Junker, Reference Patz and Junker1999), is equally computationally intensive, even though it is preferable with smaller sample sizes. It usually needs a long chain to converge for complex models. In addition, Chen, Li, and Zhang (Reference Chen, Li and Zhang2019) and H. Zhang, Chen, and Li (Reference Zhang, Chen and Li2020) studied the joint maximum likelihood estimation by treating the latent abilities as fixed effect parameters instead of random variables; though computationally efficient, such joint likelihood based estimation approaches may be less statistically efficient than the marginal likelihood estimation (e.g., Cho, Wang, Zhang, and Xu (Reference Cho, Wang, Zhang and Xu2021)).

Most recently, a variational approximation approach to the marginal likelihood was proposed, namely the Gaussian Variational EM (GVEM) algorithm (Cho et al., Reference Cho, Wang, Zhang and Xu2021). GVEM adopts a variational lower bound of the intractable likelihood within the EM framework. The carefully constructed variational lower bound allows one to derive closed-form updates for all model parameters in the iterative EM steps, making the algorithm computationally efficient. Cho et al. (Reference Cho, Wang, Zhang and Xu2021) also proposed a stochastic version of GVEM to further improve its computational efficiency when both the number of subjects, N, and the number of test items, J, are large. The idea is to stochastically optimize the variational approximation in the E step, i.e., subsample data to form noisy estimate of the variational lower bound and iteratively update the estimate with a decreasing step size (Hoffman, Blei, Wang, & Paisley, Reference Hoffman, Blei, Wang and Paisley2013). The combined advantage of having simple closed-form updates and stochastic optimization makes the GVEM algorithm appealing to high-dimensional MIRT models. Additionally, it was shown that GVEM works well in complex M3PL models compared to the existing methods.

In this paper, we propose to extend the GVEM algorithm by adding a regularization penalty to simultaneously estimate item factor loading structure and model parameters. Our study differs from Sun et al. (Reference Sun, Chen, Liu, Ying and Xin2016)’s in the following aspects: (1) we use GVEM as the estimation algorithm instead of the quadrature-based EM algorithm, hence the new method is more suitable to tackle high-dimensional challenge; (2) we consider both Lasso and adaptive Lasso (Zou, Reference Zou2006), the latter of which produces more accurate loading structure recovery; (3) we apply the new method to both the M2PL and M3PL models.

The rest of the paper is organized as follows. Section 2 briefly introduces the GVEM algorithm for the MIRT models. Section 3 presents the general regularized variational algorithm. Sections 4 and 5 illustrate the performance of the proposed methods with simulation studies and real data analysis, respectively. Section 6 discusses potential future studies, and the supplementary material includes the derivations of the estimation procedures and additional data analysis results.

Variational Estimation for MIRT

In this section, we will briefly present the key idea of variational approximation discussed in Cho et al. (Reference Cho, Wang, Zhang and Xu2021). The exposition will be based on the M3PL model, but it can be easily simplified to the M2PL model. For conciseness, let us denote the model parameters for the MIRT models by $A = {α_{j}, j = 1, \dots, J}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{A}}=\{{\varvec{\alpha }}_j,j=1,\dots ,J\}$$\end{document} , $B = {b_{j}, j = 1, \dots, J}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{B}} =\{b_j, j=1,\dots ,J\}$$\end{document} , and $C = {c_{j}, j = 1, \dots, J}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {\varvec{C}} =\{c_j, j=1,\dots ,J\}$$\end{document} . Also, denote the responses $Y = {Y_{i}, i = 1, \dots, N}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{Y}} = \{Y_i, i=1, \dots , N\}$$\end{document} where $Y_{i} = {Y_{ij}, j = 1, \dots, J}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Y_i = \{Y_{ij}, j=1,\dots ,J\}$$\end{document} is the ith subject’s response vector. Due to the typical local independence assumption in IRT, the log-marginal likelihood of $A$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{A}}$$\end{document} , $B$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{B}}$$\end{document} , and $C$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{C}}$$\end{document} in M3PL model given the responses $Y$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathbf {Y}}$$\end{document} is

(3)

\begin{matrix} l (A, B, C; Y) = \sum_{i = 1}^{N} log P (Y_{i} ∣ A, B, C) = \sum_{i = 1}^{N} log \int \prod_{j = 1}^{J} P (Y_{ij} ∣ θ_{i}, A, B, C) ϕ (θ_{i}) d θ_{i} \end{matrix}

where N is the total number of respondents and J is the total number of items in the test. Similarly this holds for the M2PL model with model parameters $A$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{A}}$$\end{document} and $B$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{B}}$$\end{document} . Here, $ϕ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\phi $$\end{document} denotes the K-dimensional Gaussian distribution of $θ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{\theta }}$$\end{document} with mean $0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{0}}$$\end{document} and covariance $Σ_{θ}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Sigma _{\varvec{\theta }}$$\end{document} . The maximum likelihood estimators of the model parameters are then obtained from maximizing the marginal likelihood function, which is often intractable under MIRT.

From here onwards, $M_{p}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$M_p$$\end{document} is used to denote all model parameters for simplicity. Following Cho et al. (Reference Cho, Wang, Zhang and Xu2021), the variational approximation of (3) can be derived as follows. First, for any arbitrary probability density function $q_{i} (\cdot)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q_i(\cdot )$$\end{document} , we can rewrite the log-marginal likelihood in Eq. 3 as

\begin{matrix} l (M_{p}; Y) = & \sum_{i = 1}^{N} \int_{θ_{i}} log P (Y_{i} ∣ M_{p}) \times q_{i} (θ_{i}) d θ_{i} \\ = & \sum_{i = 1}^{N} \int_{θ_{i}} log \frac{P (Y_{i}, θ_{i} ∣ M_{p})}{P (θ_{i} ∣ Y_{i}, M_{p})} \times q_{i} (θ_{i}) d θ_{i} \\ = & \sum_{i = 1}^{N} \int_{θ_{i}} log \frac{P (Y_{i}, θ_{i} ∣ M_{p})}{q_{i} (θ_{i})} \times q_{i} (θ_{i}) d θ_{i} + K L {q_{i} (θ_{i}) ‖ P (θ_{i} ∣ Y_{i}, M_{p})}, \end{matrix}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} l(M_p; {\mathbf {Y}})= & {} \sum _{i=1}^N \int _{{\varvec{\theta }}_i} \log P(Y_{i}\mid M_p) \times q_i({\varvec{\theta }}_i)d{\varvec{\theta }}_i\\= & {} \sum _{i=1}^N \int _{{\varvec{\theta }}_i} \log \frac{P(Y_{i},{\varvec{\theta }}_i\mid M_p)}{P({\varvec{\theta }}_i\mid Y_i,M_p)} \times q_i({\varvec{\theta }}_i)d{\varvec{\theta }}_i\\= & {} \sum _{i=1}^N \int _{{\varvec{\theta }}_i} \log \frac{P(Y_{i},{\varvec{\theta }}_i\mid M_p)}{q_i({\varvec{\theta }}_i)} \times q_i({\varvec{\theta }}_i)d{\varvec{\theta }}_i+ KL\{q_i({\varvec{\theta }}_i) \Vert P({\varvec{\theta }}_i \mid Y_i,M_p)\}, \end{aligned}$$\end{document}

where $K L {q_{i} (θ_{i}) ‖ P (θ_{i} ∣ Y_{i}, M_{p})} = \int_{θ_{i}} log \frac{q_{i} (θ_{i})}{P (θ_{i} ∣ Y_{i}, M_{p})} \times q_{i} (θ_{i}) d θ_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ KL\{q_i({\varvec{\theta }}_i) \Vert P({\varvec{\theta }}_i \mid Y_i,M_p)\} =\int _{{\varvec{\theta }}_i} \log \frac{q_i({\varvec{\theta }}_i)}{P({\varvec{\theta }}_i\mid Y_i,M_p)} \times q_i({\varvec{\theta }}_i)d{\varvec{\theta }}_i $$\end{document} denotes the Kullback–Leibler (KL) distance between the distributions $q_{i} (θ_{i})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q_i({\varvec{\theta }}_i)$$\end{document} and $P (θ_{i} ∣ Y_{i}, M_{p})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P({\varvec{\theta }}_i\mid Y_i,M_p)$$\end{document} . Then, since $K L {q_{i} (θ_{i}) ‖ P (θ_{i} ∣ Y_{i}, M_{p})} \geq 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$KL\{q_i({\varvec{\theta }}_i) \Vert P({\varvec{\theta }}_i \mid Y_i,M_p)\} \ge 0$$\end{document} , we have a lower bound of the marginal likelihood as

(4)

\begin{matrix} l (M_{p}; Y) \geq \sum_{i = 1}^{N} \int_{θ_{i}} log P (Y_{i}, θ_{i} ∣ M_{p}) \times q_{i} (θ_{i}) d θ_{i} - \sum_{i = 1}^{N} \int_{θ_{i}} log q_{i} (θ_{i}) \times q_{i} (θ_{i}) d θ_{i} . \end{matrix}

Note that the equality in (4) holds if and only if $q_{i} (θ_{i}) = P (θ_{i} ∣ Y_{i}, M_{p})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q_i({\varvec{\theta }}_i)=P({\varvec{\theta }}_i\mid Y_i,M_p)$$\end{document} for $i = 1, \dots, N$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i=1,\dots ,N$$\end{document} . Thus, to use the lower bound in (4) to approximate the marginal likelihood $l (M_{p}; Y)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$l(M_p;{\mathbf {Y}})$$\end{document} , the posterior distribution $P (θ_{i} ∣ Y_{i}, M_{p})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P({\varvec{\theta }}_i\mid Y_i,M_p)$$\end{document} gives the best choice of the variational distribution function $q_{i} (θ_{i})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q_i({\varvec{\theta }}_i)$$\end{document} . However, such a choice of $q_{i} (θ_{i})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q_i({\varvec{\theta }}_i)$$\end{document} is not practically applicable as the posterior distribution $P (θ_{i} ∣ Y_{i}, M_{p})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P({\varvec{\theta }}_i\mid Y_i,M_p)$$\end{document} is unknown. Alternatively, we could choose $q_{i} (θ_{i})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q_i({\varvec{\theta }}_i)$$\end{document} as a tractable approximation of $P (θ_{i} ∣ Y_{i}, M_{p})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P({\varvec{\theta }}_i\mid Y_i,M_p)$$\end{document} . One example is the EM algorithm, which can be viewed as choosing $q_{i} (θ_{i})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q_i({\varvec{\theta }}_i)$$\end{document} as the estimated posterior $P (θ_{i} ∣ Y_{i}, {\hat{M}}_{p})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P({\varvec{\theta }}_i\mid Y_i,{{\hat{M}}}_p)$$\end{document} with ${\hat{M}}_{p}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{\hat{M}}}_p$$\end{document} from a previous EM step estimate. However, in the MIRT model, it is known that the expectation in E-step with respect to the posterior distribution of $θ_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{\theta }}_i$$\end{document} , i.e., the first term in (4) with $q_{i} (θ_{i})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q_i({\varvec{\theta }}_i)$$\end{document} being the estimated posterior $P (θ_{i} ∣ Y_{i}, {\hat{M}}_{p})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P({\varvec{\theta }}_i\mid Y_i,{{\hat{M}}}_p)$$\end{document} , does not have an explicit form and often is challenging to compute.

Different from the EM algorithm, the variational inference method uses alternative choices of the $q_{i} (θ_{i})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q_i({\varvec{\theta }}_i)$$\end{document} ’s to have a computationally more efficient estimation of the lower bound in (4). Since the posterior distribution $P (θ_{i} ∣ Y_{i}, M_{p})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P({\varvec{\theta }}_i\mid Y_i,M_p)$$\end{document} for the MIRT model can be well approximated by a Gaussian distribution as the number of items J increases, following (Cho et al. Reference Cho, Wang, Zhang and Xu2021), we choose $q_{i} (θ_{i})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q_i({\varvec{\theta }}_i)$$\end{document} from a family of Gaussian distributions and estimate the model parameters by the GVEM algorithm. In particular, in the E-step, $q_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q_i$$\end{document} is estimated within the Gaussian family to minimize the KL distance between $q_{i} (θ_{i})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q_i({\varvec{\theta }}_i)$$\end{document} and $P (θ_{i} ∣ Y_{i}, M_{p})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P({\varvec{\theta }}_i\mid Y_i,M_p)$$\end{document} , and we then evaluate the expectation of the likelihood lower bound with respect to the estimated $q_{i} (θ_{i})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q_i({\varvec{\theta }}_i)$$\end{document} . In the M-step, the expectation is maximized to update all model parameters. Carefully chosen $q_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q_i$$\end{document} yield closed-form updates for all model parameters (Cho et al., Reference Cho, Wang, Zhang and Xu2021), making the algorithm computationally efficient.

Regularized Estimation of Loading Structure

In this paper, our main interest is to estimate a sparse loading structure, denoted as $Q_{A} = (q_{jk})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Q_{A} = (q_{jk})$$\end{document} where $q_{jk} = I (α_{jk} \neq 0)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q_{jk} = I({\varvec{\alpha }}_{jk} \ne 0)$$\end{document} . Similar to Sun et al. (Reference Sun, Chen, Liu, Ying and Xin2016), we cast the problem of sparsity estimation as a latent variable selection problem and solve it using the regularized regression via $L_{1}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L_1$$\end{document} –type penalization. One main contribution is to apply variational approach to avoid directly calculating intractable marginal likelihood while solving the regularization problem.

Although Lasso regularization is a popular technique for simultaneous model estimation and efficient variable selection, there has been some arguments against the Lasso oracle statement. For instance, Zou (Reference Zou2006) argued that there exist nontrivial conditions for the Lasso variable selection to be consistent and thus Lasso rarely enjoys oracle properties. Although the computational efficiency of Lasso is appealing for the estimation problems in high-dimensional MIRT models, the bias of the Lasso may prevent consistent variable selection and model estimation. On the other hand, adaptive Lasso is shown to enjoy oracle properties if the regularization parameters are chosen to be data-dependent (Zou, Reference Zou2006). Since it is a convex optimization problem, its global optimizer can be efficiently solved. Additionally, adaptive Lasso is a simple extension of Lasso, which makes it easy to implement with the existing algorithm for the Lasso and is computationally efficient as well. Hence, adaptive Lasso is a good candidate as a penalization method for identifying item factor loading structure in MIRT. Specifically for parameter estimation, we solve the following optimization problem;

(5)

\begin{matrix} ({\hat{A}}_{λ}, {\hat{B}}_{λ}, {\hat{C}}_{λ}) = {argmax}_{A, B, C} l (A, B, C; Y) - P_{λ} (A) \end{matrix}

where

\begin{matrix} P_{λ} (A) = λ \sum_{j = 1}^{J} \sum_{k = 1}^{K} {\hat{w}}_{jk} | α_{jk} | \end{matrix}

with ${\hat{w}}_{jk} = 1 / | {\hat{α}}_{jk}^{(0)} |^{γ}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{\hat{w}}}_{jk} = {1}/{|{\hat{\alpha }}^{(0)}_{jk}|^\gamma }$$\end{document} , ${\hat{α}}_{jk}^{(0)}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\hat{\alpha }}^{(0)}_{jk}$$\end{document} an initial estimator of $α_{jk}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha _{jk}$$\end{document} without the regularization penalty, and $γ > 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma >0$$\end{document} and $λ > 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\lambda >0$$\end{document} the tuning parameters. In the adaptive Lasso penalization, we use adaptive penalization weights for each parameter $α_{jk}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha _{jk}$$\end{document} , instead of a constant penalization parameter $λ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\lambda $$\end{document} as in Lasso. The penalization weight for $α_{jk}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha _{jk}$$\end{document} is $λ {\hat{w}}_{jk} = λ / | {\hat{α}}_{jk}^{(0)} |^{γ}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\lambda {{\hat{w}}}_{jk} = {\lambda }/{|{\hat{\alpha }}^{(0)}_{jk}|^\gamma }$$\end{document} . Thus, ${\hat{α}}_{jk}^{(0)} < 1$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\hat{\alpha }}^{(0)}_{jk} < 1$$\end{document} will get penalized more than the bigger values such as ${\hat{α}}_{jk}^{(0)} > 1$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\hat{\alpha }}^{(0)}_{jk} > 1$$\end{document} . The weight is chosen to be dependent on data to satisfy the regulatory conditions discussed in Zou (Reference Zou2006). Particularly, Zou (2006) recommended three values, 0.5, 1, and 2, for the $γ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma $$\end{document} parameter, and the selection of the $λ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\lambda $$\end{document} parameter will be discussed in Sect. 3.2.

To ensure identifiability, we impose certain constraints on the a $K \times K$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K\times K$$\end{document} sub-matrix of $Q_{A}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Q_A$$\end{document} . For the remaining part of the A matrix, we do not assume any pre-specified zero structure but instead, the appropriate penalization is imposed to shrink $α_{jk}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{\alpha }}_{jk}$$\end{document} ’s to recover the true zero structure, $Q_{A}^{*}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Q^*_A$$\end{document} . Below are two different constraints on the A matrix. Note that the second constraint is more flexible; hence, it is more challenging estimation wise. Except for adding constraints on $Q_{A}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Q_A$$\end{document} , we also fix the diagonals of $Σ_{θ}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Sigma _{\varvec{\theta }}$$\end{document} at 1. Similar to Sun et al. (Reference Sun, Chen, Liu, Ying and Xin2016), we will compare the performance of these two constraint settings in the simulation study.

Constraint 1 To ensure identifiability, we designate one item for each latent factor and this item is associated with only that factor. That is, we set a $K \times K$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K\times K$$\end{document} sub-matrix of $Q_{A}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Q_A$$\end{document} to be an identity matrix, $I_{K}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$I_K$$\end{document} . Together with the constraints on the variance of $Σ_{θ}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Sigma _{\varvec{\theta }}$$\end{document} , we have $K^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K^2$$\end{document} constraints in total.

Constraint 2 Instead of setting all off-diagonals of a $K \times K$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K\times K$$\end{document} sub-matrix of $Q_{A}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Q_A$$\end{document} to be zero, we keep the sub-matrix of $Q_{A}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Q_A$$\end{document} to be a triangular matrix with the diagonal being ones. That is, there are test items associated with each factor for sure and they may be associated with other factors as well. Nonzero entries except for the diagonal entries of $Q_{A}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Q_A$$\end{document} are penalized during the estimation procedure. Although this constraint is much weaker than the Constraint 1, it still ensures empirical identifiability when proper regularized likelihood such as (5) is used for the model estimation (Sun et al., Reference Sun, Chen, Liu, Ying and Xin2016).

Additional Penalty for M3PL

The parameter estimation for M3PL in practice often gets more challenging due to the inclusion of guessing parameters. To tackle this challenge and improve the accuracy of the parameter estimation in M3PL, we propose to impose additional constraints on the model parameters, $B = {b_{j}; j = 1, \dots, J}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathbf {B}}= \{b_j; j=1, \dots , J\}$$\end{document} and $C = {c_{j}; j = 1, \dots, J}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathbf {C}}= \{c_j; j=1,\dots , J\}$$\end{document} in addition to the parameter matrix $A$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathbf {A}}$$\end{document} . Specifically for parameter estimation, we solve the following optimization problem where $P (\cdot)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P(\cdot )$$\end{document} denotes a penalty function on model parameters:

(6)

\begin{matrix} ({\hat{A}}_{λ}, {\hat{B}}_{λ}, {\hat{C}}_{λ}) = {argmax}_{A, B, C} l (A, B, C; Y) - P_{λ} (A) + P (B) + P (C) \end{matrix}

where $P (B) = \sum_{j = 1}^{J} log N (b_{j} | μ_{b}, σ_{b}^{2})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P({\mathbf {B}}) = \sum _{j=1}^J\log N(b_j|\mu _b,\sigma _b^2)$$\end{document} , and $P (C) = \sum_{j = 1}^{J} log B e t a (c_{j} | α_{c}, β_{c})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P({\mathbf {C}}) = \sum _{j=1}^J\log Beta(c_j|\alpha _c, \beta _c)$$\end{document} for some distribution parameters $μ_{b}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu _b$$\end{document} , $σ_{b}^{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma _b^2$$\end{document} , $α_{c}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha _c$$\end{document} , and $β_{c}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta _c$$\end{document} . These penalty functions are chosen to satisfy the ranges of values on which the parameters are defined. For instance, since the guessing parameters $C$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathbf {C}}$$\end{document} naturally satisfy the constraint ${0 < c_{j} < 1; j = 1, \dots, J}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{0< c_j < 1; j=1,\dots ,J\}$$\end{document} , we can assume a “prior” distribution of $c_{j} \sim B e t a (α_{c}, β_{c})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c_j \sim Beta(\alpha _c,\beta _c)$$\end{document} . Similarly, we can assume a “prior” distribution of $b_{j} \sim N (μ_{b}, σ_{b}^{2})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$b_j \sim N(\mu _b,\sigma ^2_b)$$\end{document} . The penalty on $b_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$b_j$$\end{document} and $c_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c_j$$\end{document} are essentially a $L_{2}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L_2$$\end{document} -type and Laplace penalization, respectively. By imposing these additional penalties on model parameters $B$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathbf {B}}$$\end{document} and $C$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathbf {C}}$$\end{document} , the parameter estimation becomes more stable and robust.

The approach of imposing additional penalty on model parameters $B$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathbf {B}}$$\end{document} and $C$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathbf {C}}$$\end{document} with the chosen distributions is similar to the Bayes modal estimation presented by Tierney and Kadane (Reference Tierney and Kadane1986). That is, an augmented optimization objective is employed that includes the likelihood and some prior beliefs on the item parameters. These priors can be used to prevent deviant parameter estimates and help the algorithm to produce more accurate estimation in complex M3PL models. Essentially, Bayes modal estimation can be seen as a regularization on maximum likelihood estimation where maximum likelihood estimation is a special case of Bayes model estimation that assumes uniform prior distributions.

The amount of penalization can be flexibly controlled using the distribution parameters. For instance, one can use non-informative priors on $C$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathbf {C}}$$\end{document} such as Beta(1, 1), which is equivalent to flat uniform distribution on [0, 1]. Additionally, one can similarly choose non-informative normal prior with high variance $σ_{b}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma _b$$\end{document} for $B$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathbf {B}}$$\end{document} . This suggests that although additional penalization functions are added, the algorithm also allows the flexible estimation with essentially no penalty with the choice of non-informative distributions. The advantage of this is that practitioners can adjust the amount of prior knowledge they would like to impose on the model. The less prior knowledge one uses, the more flexible the estimation is and the results will be based more on the observed data. With these prior-like penalties, our algorithm yields more precise parameter estimates for the M3PL model.

Computation via GVEM

This section introduces the main estimation algorithm to obtain the estimate $({\hat{A}}_{λ}, {\hat{B}}_{λ}, {\hat{C}}_{λ})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$({\hat{A}}_{\lambda }, {\hat{B}}_{\lambda }, {\hat{C}}_{\lambda })$$\end{document} via (6) using GVEM algorithm. As introduced in Sect. 2, we will use a variational lower bound to approximate the intractable marginal log-likelihood $l (A, B, C; Y)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$l({\mathbf {A}},{\mathbf {B}},{\mathbf {C}}; {\mathbf {Y}})$$\end{document} in (6).

To derive a lower bound for easy estimation of the M3PL parameters, instead of directly working with (4), we employ an equivalent representation of the M3PL model with auxiliary latent variable $Z_{ij}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Z_{ij}$$\end{document} , which is an indicator function of whether the ith individual answers the jth item based on the latent ability or guesses it correctly (von Davier, Reference von Davier2009). Specifically $Z_{ij} = 1$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Z_{ij}=1$$\end{document} if the ith individual solves item j based on his/her ability, and $Z_{ij} = 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Z_{ij} = 0$$\end{document} if he/she guesses item j correctly. The distribution of $Y_{ij}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Y_{ij}$$\end{document} given the latent variables $θ_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{\theta }}_i$$\end{document} and $Z_{ij}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Z_{ij}$$\end{document} is then

\begin{matrix} P (Y_{ij} | Z_{ij}, θ_{i}) = {[\frac{exp (α_{j}^{⊤} θ_{i} - b_{j})}{1 + exp (α_{j}^{⊤} θ_{i} - b_{j})}]^{Y_{ij}} [\frac{1}{1 + exp (α_{j}^{⊤} θ_{i} - b_{j})}]^{1 - Y_{ij}}}^{Z_{ij}} I {(Y_{ij} = 1)}^{1 - Z_{ij}}, \end{matrix}

where we define $0^{0} = 1$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$0^0=1$$\end{document} , and it can be seen that this new model with auxiliary variable Z is equivalent to the M3PL model (von Davier, Reference von Davier2009, Cho et al., Reference Cho, Wang, Zhang and Xu2021). Denote $Z_{i} = {Z_{i 1}, Z_{i 2}, \dots, Z_{iJ}}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{Z}}_i = \{Z_{i1}, Z_{i2}, \dots , Z_{iJ} \}$$\end{document} and its distribution as $p (Z_{i}) = \prod_{j = 1}^{J} p (Z_{ij})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p({\varvec{Z}}_i) = \prod _{j=1}^J p(Z_{ij})$$\end{document} . Then the complete data likelihood of the ith subject can be written as

(7)

\begin{matrix} log P (Y_{i}, θ_{i}, Z_{i} ∣ A, B, C) \\ = & log P (Y_{i} ∣ θ_{i}, Z_{i}, A, B, C) + log ϕ (θ_{i}) + log p (Z_{i}) \\ = & \sum_{j = 1}^{J} (Y_{ij} Z_{ij} (α_{j}^{⊤} θ_{i} - b_{j}) + Z_{ij} log \frac{1}{1 + exp (α_{j}^{⊤} θ_{i} - b_{j})} + (1 - Z_{ij}) log I (Y_{ij} = 1)) \\ + log ϕ (θ_{i}) + log p (Z_{i}), \end{matrix}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned}&\log P(Y_i,{\varvec{\theta }}_i, {\varvec{Z}}_i \mid {\varvec{A}}, {\varvec{B}}, {\varvec{C}})\nonumber \\= & {} \log P(Y_i\mid {\varvec{\theta }}_i, {\varvec{Z}}_i, {\varvec{A}},{\varvec{B}},{\varvec{C}}) +\log \phi ({\varvec{\theta }}_i) + \log p({\varvec{Z}}_i)\nonumber \\= & {} \sum _{j=1}^J \left\{ Y_{ij} Z_{ij} ({\varvec{\alpha }}_j^\top {\varvec{\theta }}_i-b_j) + Z_{ij} \log \frac{1}{1+\exp ({\varvec{\alpha }}_j^\top {\varvec{\theta }}_i-b_j)} + (1-Z_{ij})\log I(Y_{ij}=1)\right\} \nonumber \\&+ \log \phi ({\varvec{\theta }}_i) + \log p({\varvec{Z}}_i), \end{aligned}$$\end{document}

where $ϕ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\phi $$\end{document} denotes the normal probability density function for latent variable $θ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{\theta }}$$\end{document} . Here, without loss of generality, we focus on the ith subject’s likelihood function due to the independence of different subjects.

With the above representation, for any variational distribution functions $q_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q_i$$\end{document} and $r_{ij}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r_{ij}$$\end{document} (to be estimated later) of the latent variables $θ_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{\theta }}_i$$\end{document} and $Z_{ij}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Z_{ij}$$\end{document} , similar to the derivation in Sect. 2, we have the following variational lower bound, which generalizes (4),

(8)

\begin{matrix} log P (Y_{i} ∣ A, B, C) \geq & \int_{θ_{i}} \sum_{Z_{i}} log P (Y_{i}, θ_{i}, Z_{i} ∣ A, B, C) \times q_{i} (θ_{i}) r_{i} (Z_{i}) d θ_{i} \end{matrix}

(9)

\begin{matrix} - \int_{θ_{i}} \sum_{Z_{i}} log (q_{i} (θ_{i}) r_{i} (Z_{i})) \times q_{i} (θ_{i}) r_{i} (Z_{i}) d θ_{i}, \end{matrix}

where $r_{i} (Z_{i}) = \prod_{j = 1}^{J} r_{ij} (Z_{ij})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r_i({\varvec{Z}}_i) = \prod _{j=1}^J r_{ij}(Z_{ij})$$\end{document} . Since (9) doesn’t depend on parameters $A$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{A}}$$\end{document} , $B$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{B}}$$\end{document} and $C$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{C}}$$\end{document} , we focus on (8) for the derivation of the lower bound. For (8), note that $log P (Y_{i}, θ_{i}, Z_{i} ∣ A, B, C)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\log P(Y_{i},{\varvec{\theta }}_i, {\varvec{Z}}_i\mid {\varvec{A}},{\varvec{B}},{\varvec{C}})$$\end{document} takes the form of (7). To obtain a closed form lower bound expression for (8), we further use a local variational method (Bishop, Reference Bishop2006; Jordan, Ghahramani, Jaakkola, & Saul, Reference Jordan, Ghahramani, Jaakkola and Saul1999). Particularly, define $ξ_{i, j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\xi _{i,j}$$\end{document} as a variational parameter indexed by i and j, and let $η (ξ_{i, j}) = {(2 ξ_{i, j})}^{- 1} [e^{ξ_{i, j}} / (1 + e^{ξ_{i, j}}) - 1 / 2] .$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \eta ({\xi _{i,j}})= (2{\xi _{i,j}})^{-1}[ {e^{\xi _{i,j}}}/{(1+e^{\xi _{i,j}})}-{1}/{2} ]. $$\end{document} Let $ξ_{i} = (ξ_{i, j}, j = 1, \dots, J)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\xi }_i = (\xi _{i,j}, j=1,\cdots , J)$$\end{document} denote the ith subject’s variational parameters for the J items. Then following the local variational method (Bishop, Reference Bishop2006), we have

\begin{matrix} log P (Y_{i}, θ_{i}, Z_{i} ∣ A, B, C) \geq l (A, B, C, ξ_{i}; Y_{i}, θ_{i}, Z_{i}), \end{matrix}

where $l (A, B, C, ξ_{i}; Y_{i}, θ_{i}, Z_{i})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$l({\varvec{A}}, {\varvec{B}},{\varvec{C}}, \varvec{\xi }_i ; Y_i,{\varvec{\theta }}_i,{\varvec{Z}}_{i})$$\end{document} is defined as

(10)

\begin{matrix} l (A, B, C, ξ_{i}; Y_{i}, θ_{i}, Z_{i}) \\ = & \sum_{j = 1}^{J} Z_{ij} log \frac{e^{ξ_{i, j}}}{(1 + e^{ξ_{i, j}})} + \sum_{j = 1}^{J} Z_{ij} Y_{ij} (α_{j}^{⊤} θ_{i} - b_{j}) + \sum_{j = 1}^{J} \frac{1}{2} Z_{ij} (b_{j} - α_{j}^{⊤} θ_{i} - ξ_{i, j}) \\ - \sum_{j = 1}^{J} Z_{ij} η (ξ_{i, j}) {{(b_{j} - α_{j}^{⊤} θ_{i})}^{2} - ξ_{i, j}^{2}} \\ + \sum_{j = 1}^{J} ((1 - Z_{ij}) log I (Y_{ij} = 1)) + log ϕ (θ_{i}) + log p (Z_{i}), \end{matrix}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned}&l({\varvec{A}}, {\varvec{B}},{\varvec{C}}, \varvec{\xi }_i ; Y_i,{\varvec{\theta }}_i,{\varvec{Z}}_{i}) \nonumber \\= & {} \sum _{j=1}^J Z_{ij}\log \frac{e^{\xi _{i,j}}}{(1+e^{\xi _{i,j}})} +\sum _{j=1}^J Z_{ij} Y_{ij}({\varvec{\alpha }}_j^\top {\varvec{\theta }}_i-b_j) +\sum _{j=1}^J \frac{1}{2}Z_{ij}(b_j-{\varvec{\alpha }}_j^\top {\varvec{\theta }}_i-{\xi _{i,j}})\nonumber \\&-\sum _{j=1}^J Z_{ij}\eta ({\xi _{i,j}})\{(b_j-{\varvec{\alpha }}_j^\top {\varvec{\theta }}_i)^2-\xi _{i,j}^2\}\nonumber \\&+ \sum _{j=1}^J \left\{ (1-Z_{ij})\log I(Y_{ij}=1) \right\} +\log \phi ({\varvec{\theta }}_i) + \log p({\varvec{Z}}_{i}), \end{aligned}$$\end{document}

and it gives a lower bound of $log P (Y_{i}, θ_{i}, Z_{i} ∣ A, B, C)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\log P(Y_i,{\varvec{\theta }}_i, {\varvec{Z}}_{i} \mid {\varvec{A}}, {\varvec{B}}, {\varvec{C}})$$\end{document} in (8). We then have the following expression for the variational lower bound of the marginal likelihood of all observed responses in (6),

\begin{matrix} l (A, B, C; Y) & = \sum_{i = 1}^{N} log P (Y_{i} ∣ A, B, C) \geq E (A, B, C, ξ), \end{matrix}

with the lower bound $E (A, B, C, ξ)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$E({\varvec{A}},{\varvec{B}},{\varvec{C}}, \varvec{\xi })$$\end{document} defined as

(11)

\begin{matrix} E (A, B, C, ξ) = \sum_{i = 1}^{N} \int_{θ_{i}} [\sum_{Z_{i}} l (A, B, C, ξ_{i}; Y_{i}, θ_{i}, Z_{i}) \times r_{i} (Z_{i})] \times q_{i} (θ_{i}) d θ_{i} . \end{matrix}

Appropriate choices of the variational distributions will lead to a closed form expression of the lower bound in (11). Particularly, following the derivations in Cho et al. (Reference Cho, Wang, Zhang and Xu2021), the above likelihood function implies that an optimal choice of $q_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q_i$$\end{document} is $q_{i} (θ_{i}) \sim N (θ_{i} ∣ μ_{i}, Σ_{i})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ q_i({\varvec{\theta }}_i) \sim N({\varvec{\theta }}_i \mid \mu _i,\Sigma _i )$$\end{document} where the mean and covariance are

(12)

\begin{matrix} μ_{i} = & Σ_{i} \times \sum_{j = 1}^{J} {2 η (ξ_{i, j}) b_{j} + Y_{ij} - \frac{1}{2}} (1 - Y_{ij} + E_{r} (Z_{ij}) Y_{ij}) α_{j}^{⊤}, \end{matrix}

(13)

\begin{matrix} Σ_{i}^{- 1} = & Σ_{θ}^{- 1} + 2 \sum_{j = 1}^{J} (1 - Y_{ij} + E_{r} (Z_{ij}) Y_{ij}) η (ξ_{i, j}) α_{j} α_{j}^{⊤}, \end{matrix}

and the variational distributions $r_{ij} (Z_{ij})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r_{ij}(Z_{ij})$$\end{document} are $r_{ij} (Z_{ij}) \sim B e r n o u l l i (s_{ij})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r_{ij}(Z_{ij}) \sim Bernoulli(s_{ij})$$\end{document} , where $s_{ij} = 1$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$s_{ij} = 1$$\end{document} if $Y_{ij} = 0$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Y_{ij} = 0$$\end{document} , and otherwise

(14)

\begin{matrix} s_{ij}^{- 1} = & 1 + \frac{c_{j}}{1 - c_{j}} \frac{1 + e^{ξ_{i, j}}}{e^{ξ_{i, j}}} exp {- Y_{ij} (α_{j}^{⊤} E_{q_{i}} [θ_{i}] - b_{j}) + \\ \frac{1}{2} (b_{j} - α_{j}^{⊤} E_{q_{i}} [θ_{i}] - ξ_{i, j}) - η (ξ_{i, j}) {E_{q_{i}} [{(b_{j} - α_{j}^{⊤} θ_{i})}^{2}] - ξ_{i, j}^{2}}} . \end{matrix}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} s_{ij}^{-1}= & {} 1 + \frac{c_j}{1-c_j} \frac{1+e^{\xi _{i,j}}}{e^{\xi _{i,j}}}\exp \Big \{-Y_{ij}({\varvec{\alpha }}_j^\top E_{q_i}[{\varvec{\theta }}_i]-b_j) + \nonumber \\&\frac{1}{2}(b_j-{\varvec{\alpha }}_j^\top E_{q_i}[{\varvec{\theta }}_i]-{\xi _{i,j}})-\eta ({\xi _{i,j}})\{ E_{q_i}[(b_j-{\varvec{\alpha }}_j^\top {\varvec{\theta }}_i)^2]-\xi _{i,j}^2\} \Big \}. \end{aligned}$$\end{document}

With the above chosen $q_{i}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q_i$$\end{document} ’s and $r_{ij}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r_{ij}$$\end{document} ’s, we aim to estimate model parameters $A$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{A}}$$\end{document} , $B$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{B}}$$\end{document} and $C$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{C}}$$\end{document} , together with the introduced local variational parameters $ξ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\xi }$$\end{document} , by maximizing the variational lower bound of the marginal likelihood, $E (A, B, C, ξ)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$E({\varvec{A}},{\varvec{B}},{\varvec{C}}, \varvec{\xi })$$\end{document} in (11), with the proposed penalties in (6), that is,

(15)

\begin{matrix} ({\hat{A}}_{λ}, {\hat{B}}_{λ}, {\hat{C}}_{λ}, \hat{ξ}) = {argmax}_{A, B, C, ξ} E (A, B, C, ξ) - P_{λ} (A) + P (B) + P (C) \end{matrix}

The corresponding solution $({\hat{A}}_{λ}, {\hat{B}}_{λ}, {\hat{C}}_{λ})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$({\hat{{\mathbf {A}}}}_{\lambda }, {\hat{{\mathbf {B}}}}_{\lambda }, {\hat{{\mathbf {C}}}}_{\lambda })$$\end{document} gives the our GVEM estimators for the penalized likelihood in (6).

To estimate $(A, B, C)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$({\varvec{A}}, {\mathbf {B}}, {\mathbf {C}})$$\end{document} , we use the coordinate descent algorithm (Friedman, Hastie, Höfling, & Tibshirani, Reference Friedman, Hastie, Höfling and Tibshirani2007; Friedman et al., Reference Friedman, Hastie and Tibshirani2010), which solves the target optimization problem by successively minimizing along each coordinate direction of $(A, B, C)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$({\varvec{A}}, {\mathbf {B}}, {\mathbf {C}})$$\end{document} . For each item j, there are one difficulty parameter $b_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$b_j$$\end{document} , one guessing parameter $c_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c_j$$\end{document} , and K discrimination parameters $α_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{\alpha }}_j$$\end{document} . The coordinate descent algorithm updates each of the $K + 2$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K+2$$\end{document} variables according to the following updating rule. (Please see Appendix for a detailed derivation of the updating rule.) Note that the derivation of the below soft-thresholding update rule of $a_{jk}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$a_{jk}$$\end{document} can be viewed as from the proximal gradient descent algorithm (Beck & Teboulle, Reference Beck and Teboulle2009). Define a function S to be a soft threshold operator such that

(16)

\begin{matrix} S (δ, λ) = sign (δ) {(| δ | - λ)}_{+}, \end{matrix}

where for any real number x, sign(x) denotes the sign of x and $x_{+}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x_+$$\end{document} denotes $max {0, x}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\max \{0,x\}$$\end{document} . The model parameters, $α_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{\alpha }}_j$$\end{document} ’s, $b_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$b_j$$\end{document} and $c_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c_j$$\end{document} are updated using Equations (17), (18), and (19), respectively,

(17)

\begin{matrix} α_{jk} = & [\sum_{i = 1}^{N} (1 - Y_{ij} + s_{ij} Y_{ij}) (2 η (ξ_{i, j}) {[Σ_{i} + (μ_{i}) {(μ_{i})}^{⊤}]}_{k, k})]^{- 1} \\ \times S (\sum_{i = 1}^{N} (1 - Y_{ij} + s_{ij} Y_{ij}) {(Y_{ij} - \frac{1}{2}) μ_{i, k} + 2 b_{j} η (ξ_{i, j}) μ_{i, k} \\ - 2 η (ξ_{i, j}) \sum_{l \neq k} α_{jl} {[Σ_{i} + (μ_{i}) {(μ_{i})}^{⊤}]}_{l, k}}, \frac{λ}{| {\hat{α}}_{jk}^{(0)} |^{γ}}) \end{matrix}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \alpha _{jk}= & {} \bigg [\sum _{i=1}^N (1-Y_{ij}+s_{ij}Y_{ij})\biggr (2\eta ({\xi _{i,j}})[\Sigma _i+(\mu _i)(\mu _i)^\top ]_{k,k} \biggr )\bigg ]^{-1} \nonumber \\&\times S\biggr (\sum _{i=1}^N(1-Y_{ij}+s_{ij}Y_{ij})\Big \{ (Y_{ij}-\frac{1}{2})\mu _{i,k} + 2b_j\eta ({\xi _{i,j}})\mu _{i,k} \nonumber \\&\qquad -2\eta ({\xi _{i,j}})\sum _{l \ne k}\alpha _{jl}[\Sigma _i+(\mu _i)(\mu _i)^\top ]_{l,k}\Big \} , ~\frac{\lambda }{|{\hat{\alpha }}^{(0)}_{jk}|^\gamma }\biggr ) \end{aligned}$$\end{document}

(18)

\begin{matrix} b_{j} = & \frac{\sum_{i = 1}^{N} (1 - Y_{ij} + s_{ij} Y_{ij}) [\frac{1}{2} - Y_{ij} + 2 η (ξ_{i, j}) α_{j}^{⊤} μ_{i}] + \frac{μ_{b}}{σ_{b}^{2}}}{2 \sum_{i = 1}^{N} (1 - Y_{ij} + s_{ij} Y_{ij}) η (ξ_{i, j}) + \frac{1}{σ_{b}^{2}}}, \end{matrix}

(19)

\begin{matrix} c_{j} = & \frac{\sum_{i = 1}^{N} Y_{ij} (1 - s_{ij}) + α - 1}{N + α + β - 2} . \end{matrix}

where ${\hat{α}}_{jk}^{(0)}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\hat{\alpha }}^{(0)}_{jk}$$\end{document} is the initial estimator of $α_{jk}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha _{jk}$$\end{document} by the GVEM algorithm without including the penalty terms in (15). Additionally, the variational parameter $ξ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\xi $$\end{document} ’s are updated as

(20)

\begin{matrix} ξ_{i, j}^{2} = & b_{j}^{2} - 2 b_{j} α_{j}^{⊤} μ_{i} + α_{j}^{⊤} [Σ_{i} + μ_{i} μ_{i}^{⊤}] α_{j}, \end{matrix}

and the covariance can be updated as

(21)

\begin{matrix} Σ_{θ} = & \frac{1}{N} \sum_{i = 1}^{N} [Σ_{i} + μ_{i} μ_{i}^{⊤}] . \end{matrix}

To choose the constant sparsity parameter $λ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\lambda $$\end{document} , we can apply popular information criteria, such as Akaike Information Criterion (AIC), Bayesian information criterion (BIC) and generalized information criterion (GIC) (Nishii, Reference Nishii1984; Y. Fan & Tang, Reference Fan and Tang2013). We estimate the information criteria by substituting the log-likelihood with the variational lower bound from the GVEM algorithm. The sparsity parameter that minimizes these information criteria will be considered optimal. Our pilot study shows that the GIC method proposed for high-dimensional model selection in Y. Fan and Tang (Reference Fan and Tang2013) performs better than AIC and BIC, and hence GIC is used throughout the study.

The detailed algorithm of the regularized estimation of the loading structure via adaptive Lasso penalization is illustrated in Algorithm 1.

Remark 1

In addition to our choice of adaptive Lasso for $P_{λ} (A)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P_\lambda ({\mathbf {A}})$$\end{document} in (6), there are generally other methods of penalization. For instance, J. Fan and Li (Reference Fan and Li2001) showed that the Lasso penalization problem is suboptimal to their proposed method called smoothly clipped absolute deviation (SCAD) penalty as Lasso produces biased estimates for the large coefficients. They showed that the SCAD penalization enjoys asymptotic normality and oracle properties with proper choice of regularization parameters. Due to its solid theoretical properties, SCAD has been widely applied in variable selection problems (T. Wang, Xu, & Zhu Reference Wang, Xu and Zhu2012; Liu, Yao, & Li, Reference Liu, Yao and Li2016; Breheny & Huang Reference Breheny and Huang2011). Additionally, Minimax Concave Penalty (MCP) has been presented as a fast, continuous and nearly unbiased method of penalization and hence claimed to be a good alternative to Lasso (C. H. Zhang Reference Zhang2010). Truncated Lasso is also another popular penalization method (Shen, Pan, & Zhu, Reference Shen, Pan and Zhu2012; Xu & Shang, Reference Xu and Shang2018); however, penalty function for these methods is non-convex and it makes local solutions to be nonunique in general, which is computationally challenging to solve as well. On the other hand, adaptive Lasso uses a convex penalty and it is computationally efficient, which makes it a good candidate for regularization problem under complex MIRT models. Hence, we choose adaptive Lasso for solving our regularized problem.

Simulation Study

Design

A simulation study was conducted to evaluate the performance of the regularized GVEM algorithm in identifying true item factor loading structure with both M2PL and M3PL models. Three manipulated factors were considered: (1) the number of dimensions was fixed at 3 and 5 (i.e., $K = 3, 5$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K = 3, 5$$\end{document} ); (2) the correlations among factors were fixed at either 0.1, 0.3, or 0.7; and (3) both between-item and within-item multidimensional structures were considered. The sample size was fixed at 2000 (i.e., $N = 2000$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N = 2000$$\end{document} ) and 100 replications were run.Footnote 1

For the between-item MIRT model, the test length was 45, with 15 items loaded onto each factor. The true item parameters were selected from the 2013 NAEP item bank (combined national and state assessments) for grade 8. For the within-item MIRT, the true item discrimination parameters were simulated from Unif(0.75, 2), and the difficulty parameters were drawn from the standard normal distribution. Additionally in M3PL, the guessing parameters were fixed at 0.2. The generated item parameters resemble the item parameters in Table 6.1 of (Reckase, Reference Reckase2009) closely. When the dimension was 3, about $60 %$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$60\%$$\end{document} of the items were loaded onto one factor, about $25 %$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$25\%$$\end{document} were loaded onto two factors, and the rest were loaded onto all three factors, whereas for the 5-dimension conditions, about $60 %$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$60\%$$\end{document} , $20 %$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$20\%$$\end{document} , $20 %$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$20\%$$\end{document} of the items were loaded onto one, two, and three factors, respectively. In all cases, the latent traits $θ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{\theta }}$$\end{document} were simulated from $M V N (0, Σ_{θ})$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$MVN(0,\Sigma _{\varvec{\theta }})$$\end{document} with variance 1, where $r = 0.1, 0.3$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r= 0.1, 0.3$$\end{document} or 0.7.

Six methods were compared in the study, and they are (1) traditional exploratory item factor analysis followed by the CF-Quartimax rotation. This method is denoted as “Rotation” in all results. For this method, during estimation, we did not assume any constraint on the item discrimination parameter but fixed the population covariance matrix to an identity matrix, i.e., $Σ_{θ} = I$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\Sigma }_{\varvec{\theta }}={\varvec{I}}$$\end{document} . The GVEM algorithm was used for model estimation. The final discrimination parameters were transformed to standardized factor loadings, the value of which was compared to 0.3 (Henson & Roberts, Reference Henson and Roberts2006; Costello & Osborne, Reference Costello and Osborne2005). We used $ψ_{j} = U^{- 1} α_{j}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\psi }_j={{\varvec{U}}}^{-1}\varvec{\alpha }_j$$\end{document} , where to obtain the standardized factor loadings. If $| ψ_{jk} |$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$|\psi _{jk}|$$\end{document} exceeds 0.3, the item is assumed to load on the corresponding factor. This transformation function worked for all simulated conditions except for the within-item structure, $r = 0.7, K = 3$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r=0.7, K=3$$\end{document} , M2PL and M3PL. In these two conditions, we transformed the true discrimination parameters to standardized factor loadings, and found some values were smaller than 0.3. Under these two conditions, we set the cutoff values as 0.75 instead, as the true values were generated from Unif(0.75, 2). Setting a different cutoff will certainly affect the results, and this, to some extent, implies the subjectivity in the traditional EFA rotation method. (2) Exploratory item factor analysis with fixed anchors, and it is denoted as “Fixed Anchors” in all results. For this method, we imposed constraint 1 on the $Q_{A}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Q_A$$\end{document} such that post-hoc rotation is no longer needed. We used the same transformation formula to calculate standardized factor loadings. This method was considered to ensure a direct and fair comparison to the regularization methods. (3) Lasso with constraint 1 and 2; and (4) adaptive Lasso with constraint 1 and 2. For the regularization methods, the tuning parameter $λ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\lambda $$\end{document} was chosen by GIC. The GIC was computed as follows:

\begin{matrix} G I C = log (log (N)) \times log (N) \times k - 2 \times E (A, B, C, ξ), \end{matrix}

where N refers to the sample size, k refers to the number of parameters estimated by the model, $E (A, B, C, ξ)$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$E({\varvec{A}},{\varvec{B}},{\varvec{C}}, \varvec{\xi })$$\end{document} refers to the lower bound.

In additional to the two constraints for the model ability, we truncated ${\hat{α}}_{jk}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\hat{{\varvec{\alpha }}}}_{jk}$$\end{document} to 0 if $| {\hat{α}}_{jk} | < 0.001$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$|{\hat{{\varvec{\alpha }}}}_{jk}| < 0.001$$\end{document} . As to $γ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma $$\end{document} in adaptive Lasso, Zou (2006) recommended three values, 0.5, 1, and 2. A few pilot trials were conducted to decide on the optimal $γ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma $$\end{document} , and $γ = 2$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma =2$$\end{document} was used for all conditions except a few conditions in which case $γ = 1$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma =1$$\end{document} was used. These conditions are within-item M2PL, $r = 0.7, k = 5$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r=0.7, k=5$$\end{document} , constraint 1 and 2, as well as between-item M2PL, $r = 0.7, k = 5$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r=0.7, k=5$$\end{document} , constraint 2 only.

As the main objective of this section is to estimate relationship between test items and latent traits, we used the correct estimation rate of A matrix (eq. (22)). It measures how well the sparsity of the A matrix is estimated by the regularized estimation. Notice that we only calculated correct rate for entries excluding the first K by K sub-matrix since we fixed this part to have identity matrix as a zero structure to ensure identifiability.

(22)

\begin{matrix} C R = \frac{1}{K \times J} \sum_{1 \leq j < J, 1 \leq k \leq K} I ({\hat{Q}}_{jk} = Q_{jk}^{true}) \end{matrix}

We also compared the performance of Lasso and adaptive Lasso penalization using two measures: sensitivity and specificity. In our context, sensitivity is the probability of correctly identifying nonzero entries among true nonzero entries. Specificity is the probability of correctly identifying zero entries among true zero entries. In other words, sensitivity measures the true negative rate, while specificity illustrates the true positive rate. Naturally, a test with both high sensitivity and high specificity is desired, although there is always a trade-off.

Other criteria include the average relative bias and root mean squared error (RMSE). The parameter recovery for $Σ_{θ}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Sigma _{\varvec{\theta }}$$\end{document} is calculated by taking differences between each freely estimated entries of the true $Σ_{θ}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Sigma _{\varvec{\theta }}$$\end{document} and estimated ${\hat{Σ}}_{θ}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\hat{\Sigma }}_{\varvec{\theta }}$$\end{document} . Relative bias and RMSE were obtained for each nonzero model parameter across all items within a condition first and then averaged over 100 replications.

Simulation Results

In this section, we first present the simulation results under various settings in M2PL and M3PL with boxplots to show the distribution of correct estimation rates, sensitivities, and specificities. Among the three information criteria, GIC showed the best performance at selecting the optimal result as it favors the models that penalizes more on the number of parameters; thus, we present the simulation results with GIC selection criteria in figures in this section.

Figures 1 and 2 show the recovery of item factor loading structure in terms of correct rates, sensitivity and specificity under M2PL and M3PL, respectively. All six methods were presented in the same order under each manipulated condition in Figures 1, 2, 3, 4, 5, 6. For M2PL, the adaptive Lasso method is consistently the best-performing method under all conditions, except when $r = 0.1, K = 5$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r=0.1, K=5$$\end{document} , item structure is within-item M2PL, and when $r = 0.3, K = 5$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r=0.3, K=5$$\end{document} , item structure is between-item M2PL. Under these two conditions, EFA rotation method performs slightly better than the adaptive Lasso method. The EFA with fixed anchors and Lasso regularization methods, on the other hand, performs a lot worse. When $K = 3$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K=3$$\end{document} , and within-item M2PL is used, EFA rotation method performs considerably worse than EFA with fixed anchors and adaptive Lasso methods. Between the two constraint settings, constraint 2 yields more free parameters and hence it is harder to handle than constraint 1. Therefore, it is not surprising that adaptive Lasso with constraint 1 performs slightly better than with constraint 2 in more challenging scenarios (i.e., higher correlation, larger K, and within-item multidimensionality), whereas the difference between the two types is almost negligible in simpler scenarios.

When M3PL is the data generating model, the recovery of item factor loading structure is generally worse than that from M2PL, with a decrement of correct rate, sensitivity, and specificity in the range of 5% to 20%. The general trend of the manipulated factors on the results stay the same as compared to M2PL. That is, increasing factor correlation or allowing item cross-loadings makes the recovery of factor structure harder, although adaptive Lasso still performs the best among the six methods in all conditions except when $r = 0.7$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r=0.7$$\end{document} , $K = 5$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K=5$$\end{document} and test exhibits between-item multidimensional structure. In this case, EFA rotation method tends to excel.

Figure. 1 Correct estimation rates of item factor loading structure under M2PL.

Figure. 2 Correct estimation rates of item factor loading structure under M3PL.

Figure 3 presents the relative bias of model parameters under M2PL. When the test has 5 latent factors and $r = 0.1$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r=0.1$$\end{document} or 0.3, although relative bias vary slightly differently for different parameters, the results from the six methods are almost indistinguishable. In a between-item structure with 3 factors, the relative bias for b has more variability across replications. It is because the true parameters of some items are close to 0. The relative bias vary more for the within-item condition in general. In a within-item structure, the two regularization methods appear to produce less bias than the EFA rotation method especially for $Σ_{θ}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Sigma _\theta $$\end{document} . Under $K = 3, r = 0.1$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K=3, r=0.1$$\end{document} or 0.7, within-item M2PL conditions, the relative bias values for $Σ_{θ}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Sigma _\theta $$\end{document} estimated by EFA rotation fall outside of the range. Figure 4 presents the RMSE of model parameters under M2PL. Again, all six methods produce comparable RMSE when $r = 0.1$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r=0.1$$\end{document} under between-item condition. When the factor correlation increases, the EFA rotation method generates larger RMSE for $α$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha $$\end{document} and $Σ_{θ}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Sigma _\theta $$\end{document} . The same trend holds under the within-item M2PL conditions, although adaptive Lasso method seems to generate large RMSE for some conditions. Under the most difficult condition of $r = 0.7, K = 5$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r=0.7, K=5$$\end{document} , one can see a lot of variability of RMSE across replications. In this case, the two Lasso methods seem to produce smaller median RMSE for majority of the parameters than EFA rotation method and EFA with fixed anchors method. The better performance of Lasso methods compared to adaptive Lasso may be because of two reasons: (1) we computed bias and RMSE only on those parameters whose true values were nonzero. Hence, even if the Lasso method fails to shrink some true zero loadings to zero, they will not count toward bias or RMSE. (2) Initial values play an important role in adaptive Lasso to determine an adaptive penalty weight. We used the results from EFA rotation methods as initial values, and other better initial values could be explored in the future, such as the SVD method in H. Zhang et al. (Reference Zhang, Chen and Li2020).

Figure. 3 Relative bias of model parameter estimates under M2PL.

Figure. 4 RMSE of model parameter estimates under M2PL.

Figure. 5 Relative bias of model parameter estimates under M3PL.

Figure. 6 RMSE of model parameter estimates under M3PL.

Figures 5 and 6 show the relative bias and RMSE of model parameters under M3PL. The inclusion of the guessing parameter, unsurprisingly, makes the model parameter recovery much harder, as shown in larger bias and RMSE as well as more variability across replications. The overall pattern observed from M2PL results continued to hold. That is, increasing factor correlation and using within-item factor structure not only increase relative bias and RMSE, but also yield more instability across replications. The EFA rotation method produces the largest average absolute bias and mean RMSE in almost all conditions, followed by EFA with fixed anchors method, although results from regularization methods seem to have more variability when the factor correlation is high.

In summary, the adaptive Lasso method outperforms EFA rotation method under almost all conditions regarding item factor loading structure recovery. There are only 3 exceptions: when $r = 0.1, K = 5$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r=0.1, K=5$$\end{document} , item structure is within-item M2PL, when $r = 0.3, K = 5$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r=0.3, K=5$$\end{document} , item structure is between-item M2PL, and $r = 0.7, K = 5$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r=0.7, K=5$$\end{document} , item structure is between-item M3PL. In these three conditions, EFA rotation method performs better than the adaptive Lasso method by a small margin. Under some simple scenarios (i.e., low-correlation or medium-correlation and $K = 3$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K=3$$\end{document} , item structure is between-item M2PL), there is no appreciable difference between the EFA rotation method and the adaptive Lasso method with either type of constraints. As for item parameter recovery, the adaptive Lasso method outperforms EFA rotation method for all of the high-correlation scenarios in M2PL. For small-correlation, between-item M2PL conditions, the results of adaptive Lasso and EFA rotation method appear to be indistinguishable. In M3PL, the adaptive Lasso method produces more accurate results compared to EFA rotation method under all conditions. Only under between-item M3PL conditions, EFA rotation generates smaller RMSE values and relative bias with less variability for $Σ_{θ}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Sigma _\theta $$\end{document} , but it produces larger RMSE and relative bias for other parameters.

Real Data Analysis

In this section, the proposed regularization method was applied to the National Education Longitudinal Study of 1988 (NELS:88) data, and results were compared with those from EFA rotation method. NELS:88 was collected from a nationally representative sample of students whose performance on different cognitive batteries were tracked from 8th to 12th grade (the first three studies) in years 1988, 1990, and 1992. In this study, we focused on the science and mathematics test data where the multidimensional factorial structure has been previously investigated (e.g, Kupermintz & Snow, Reference Kupermintz and Snow1997; Nussbaum, Hamilton, & Snow, Reference Nussbaum, Hamilton and Snow1997). Table 1 shows an example of the content of the questions in science test. For the science subject, there are 25 items and four factors were found from the data collected in 1988: “Elementary science (ES)”, “Chemistry knowledge (CK)”, “Scientific reasoning (SR)” and “Reasoning with knowledge (RK)”. For the math subject, there are 40 items in 1988 and two factors emerged. They are “Mathematical reasoning (MR)” and “Mathematical knowledge (MK)”. We pooled together data from both domains, resulting in 65 items and a complete sample size of $N = 13, 488$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N=13,488$$\end{document} .

In the previous analysis of NELS:88 by Cho et al. (2020), the GVEM approach was used to empirically estimate the optimal number of latent traits from this data set. The result suggests there exists six latent traits measured by NELS:88. This finding is consistent with what the previous literature implies (e.g, Kupermintz & Snow, Reference Kupermintz and Snow1997; Nussbaum et al., Reference Nussbaum, Hamilton and Snow1997). Thus, we fix the dimension of latent factors as six for this analysis. Also, Kupermintz and Snow (Reference Kupermintz and Snow1997) and Nussbaum et al. (Reference Nussbaum, Hamilton and Snow1997) analyzed the latent traits required by each test item based on the content of the questions. Based on their findings, we chose 6 questions that only associate with each one of latent factors and performed our proposed regularized estimation under Constraint 1.

Table 1 NELS:88 science items and descriptions were adopted from Rock et al. (Reference Rock, Pollack, Owings and Hafner1991).

Item	8th grade	10th grade	Description
S01	1		Infer geologic history from facts about limestone deposits
S02	2		Identify components of solar system
S03	3	2	Read a graph depicting solubility of chemicals
S04	4	3	Choose an improvement for an experiment on mice
S05	5	4	Choose a statement about source of moon’s light
S06	6	5	Identify the example of a simple reflex
S07	7		Choose viable way of communicating on moon
S08	8		Select statement about position of sun, moon, earth in diagram
S09	9		Identify source of oxygen in ocean water
S10	10	1	Choose the property used to classify a list of substances
S11	11		Explain lower freezing temperature of ocean water
S12	12	6	Answer question about the earth’s orbit
S13	13		Infer use of oxygen from description of condition of aquarium
S14	14	7	Estimate temperature of a mixture
S15	15	8	Select a statement about the process of respiration
S16	16	9	Read a graph depicting digestion of a protein by an enzyme
S17	17	10	Explain location of marine algae
S18	18	11	Choose best indication of an approaching storm
S19	19	12	Choose the alternative that is not a chemical change
S20	20	13	Infer statement from results of an experiment using a filter
S21	21	14	Explain reason for late afternoon breeze from the ocean
S22	22	15	Select basis for a statement about a food chain
S23	23	16	Interpret symbols describing a chemical reaction
S24	24	17	Differentiate statements based on a model or an observation
S25	25	18	Describe color of offspring from a guinea-pig cross
S26		19	Calculate a mass given density and dimensions
S27		20	Locate the balance point of a weighted lever
S28		21	Interpret a contour map
S29		22	Identify diagram depicting path of light through camera lens
S30		23	Calculate grams of a substance given its half life
S31		24	Read population graph; identify equilibrium point
S32		25	Identify cause of fire from overloaded circuit

S stands for Science items, and item descriptions were adopted from.

Both the EFA rotation (with the CF-Quartimax rotation) and adaptive Lasso methods with M2PL and M3PL were fitted to the data set. EFA rotation assumed all items load on all factors. For the adaptive Lasso method, we assumed all items load on all factors and hence penalty is added on every element in the loading matrix except for the constraints. Note that we also consider another version of adaptive Lasso which assumes math items only load on the two math factors, and science items only load on the four science factors, hence there are structural 0’s in the loading matrix that we neither estimate nor add penalty on. The results of this version are reported in the online supplementary materials to save space. Given the large sample size and test length (65 items in total), stochastic version of GVEM algorithm was used for M3PL. Specifically we used a stochastic sampling of 200 at each iteration and initially sampled 3000 for more stable convergence. For models with penalty, only adaptive Lasso was considered at it was shown to perform better than Lasso penalty under majority conditions in the simulation studies. The penalty parameter $γ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma $$\end{document} was fixed at 3 in adaptive Lasso. This is because the item factor loading structure is more complex (as compared to simulation study), hence, heavier penalization (i.e., higher $γ$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma $$\end{document} ) was used to produce a nicer sparse structure. Table 2 shows that the M2PL in general yields smaller GIC than M3PL. For the same model, adaptive Lasso produces the smaller GIC compared to the EFA rotation method. The fact that M2PL is preferred over M3PL implies that guessing may not play a big role on the performance in NELS:88 math and science assessments. Moreover, larger GIC from EFA rotation method implies that the factor loading structure obtained from it may not reflect the true item factor relationship as closely as the adaptive Lasso method.

Table 2 GIC comparison from two methods and two models (AL stands for adaptive Lasso).

M2PL		M3PL
EFA	AL	EFA	AL
$1.20 \times 10^{6}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1.20\times 10^6$$\end{document}	$0.73 \times 10^{6}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$0.73\times 10^6$$\end{document}	$1.28 \times 10^{6}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1.28\times 10^6$$\end{document}	$1.81 \times 10^{6}$ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1.81\times 10^6$$\end{document}