Impact Statement
Singe particle reconstruction in cryo-electron microscopy (cryo-EM) is an increasingly popular technique for near-atomic imaging of biological macromolecules. Both technological and computational advances have driven the progress of the techniques, yet many computational obstacles still remain. We introduce a fast method to estimate the covariance matrix of noisy cryo-EM images, which is a central component of many computational cryo-EM techniques. As an application, we use the covariance matrix for image denoising and deconvolution of the microscope’s contrast transfer function. Our method provides orders of magnitude speedup compared to previous approaches, which opens the door to tackling more challenging datasets.
1. Introduction
We study the problem of computing a compressed representation of the covariance matrix of 2-D cryo-electron microscopy (cryo-EM) images for the purpose of performing principal component analysis (PCA). More precisely, we consider an image formation model where the measurement $ {g}_i $ is defined by
where $ {h}_i $ is a radial function, $ \ast $ denotes convolution, $ {f}_i $ is a ground-truth image, $ {\varepsilon}_i $ is the noise term, and $ N $ the total number of images. We emphasize that we assume $ {h}_i $ is radial, see Assumption A1.
We are motivated by single particle cryo-EM imaging, which is an important technique for determining the 3-D structure of macromolecules. In particular, the single particle reconstruction (SPR) problem asks to recover the 3-D structure of a macromolecule from noisy 2-D images of its tomographic projections along unknown viewing angles. In cryo-EM, the mathematical model is a special case of (1) and is of the form
where $ x=\left({x}^{\prime },{x}_3\right)\in {\mathrm{\mathbb{R}}}^2\times \mathrm{\mathbb{R}}\cong {\mathrm{\mathbb{R}}}^3 $ are 3-D spatial coordinates with $ {x}_3 $ representing the projection direction, $ {h}_i $ is the point spread function, $ {\varphi}_i:{\mathrm{\mathbb{R}}}^3\to \mathrm{\mathbb{R}} $ is the electrostatic potential of a molecule, $ {R}_i\in SO(3) $ is a 3-D rotation, and $ {\varepsilon}_i $ the noise term. In computational microscopy, it is typical to work with the Fourier transform of the point spread function, which is known as the contrast transfer function (CTF). In the simplest case, each measurement could correspond to a single fixed molecule potential function $ {\varphi}_i=\varphi $ ; however, in general, we may assume that each $ {\varphi}_i $ could be a random variable representing a mixture of molecules, conformational heterogeneity, cases where the images are not perfectly centered, or other measurement imperfections(Reference Liu and Frank1,Reference Penczek, Kimmel and Spahn2) .
In general, each measurement $ {g}_i $ can be associated with a different point spread function; however, in practice, a group of measurements, called a defocus group, can share a common point spread function. We assume that the measurements are grouped into $ M\le N $ defocus groups. Given $ {g}_i $ and $ {h}_i $ for $ i=1,\dots, N $ , our goal is to estimate the 2-D covariance function $ c:{\mathrm{\mathbb{R}}}^2\times {\mathrm{\mathbb{R}}}^2\to \mathrm{\mathbb{R}} $ of the images
where $ f $ is a random variable from the same distribution as the images $ {f}_i $ , and $ \overline{f}\left({x}^{\prime}\right)=\unicode{x1D53C}\left[f\left({x}^{\prime}\right)\right] $ . We assume that the distribution of the images is invariant to in-plane rotations (which is typically the case in cryo-EM).
In cryo-EM, the random variable $ f $ is of the form $ f\left({x}^{\prime}\right)={\int}_{\mathrm{\mathbb{R}}}\Phi \left({R}^{-1}x\right){dx}_3 $ , where the random variable $ R $ is an unknown viewing angle, and the random variable $ \Phi $ is a molecule potential. There is generally no physical reason for a molecule to prefer one in-plane rotation to another so distributions of random variables of this form are generally invariant to in-plane rotations. In the field of cryo-EM processing, the covariance function $ c $ is simply referred to as the 2-D covariance.
1.1. Motivation
The 2D-covariance is an essential component of a number of computational techniques in cryo-EM; we survey a few of these below.
First, we are motivated by PCA, which is a ubiquitous technique in statistics, data science, and computational mathematics and has applications to dimensionality reduction, denoising, visualization, among others. The principal components (i.e., the top eigenvectors of the digitized covariance matrix) have a number of uses in the computational cryo-EM pipeline. The subspace corresponding to the top eigenvectors of the covariance matrix identifies salient features of the dataset which enables, for instance, improved methods for image classification and visualization, such as Multivariate Statistical Analysis(Reference Van Heel3–Reference van Heel, Portugal and Schatz6). These techniques improve computational speed, since clustering becomes computationally easier in a space of reduced dimension, as well as accuracy, since dimensionality reduction by PCA amplifies the effective signal-to-noise ratio (SNR) because many coordinates for which noise dominates the signal are eliminated(Reference Zhao and Singer7).
Second, the covariance matrix has applications in the method of moments, a classical statistical inference method, applied to cryo-EM(Reference Kam8). In this method, the 2-D covariance is used to compute the similarly defined autocorrelation function of the underlying 3-D structure. Under further assumptions such as sufficient nonuniformity of the distribution of the viewing angles(Reference Sharon, Kileel, Khoo, Landa and Singer9) or sufficient sparsity of the molecular structure(Reference Bendory, Khoo, Kileel, Mickelin and Singer10), this autocorrelation function determines the 3-D density map either up to a finite list of possible structures or uniquely, respectively. This has been further developed into principled methods for ab initio estimation of cryo-EM structures(Reference Sharon, Kileel, Khoo, Landa and Singer9,Reference Bendory, Khoo, Kileel, Mickelin and Singer10) , with reduced risk of user-induced model bias in the initial model. Alternatively, when additional information is available, for instance, one(Reference Huang, Zehni, Dokmanić and Zhao11) or two(Reference Levin, Bendory, Boumal, Kileel and Singer12) noiseless projection images, or the 3-D structure of a related, homologous structure(Reference Bhamre, Zhang and Singer13,Reference Bhamre, Zhang and Singer14) , the 3-D density map is uniquely determined by the autocorrelation, without requiring any structural assumptions.
Third, the covariance matrix has applications in denoising and CTF-correcting projection images. Covariance Wiener Filtering (CWF)(Reference Bhamre, Zhang and Singer15) is an approach that uses the classic Wiener filtering framework with the estimated covariance matrix to solve the image deconvolution and denoising problem. The technique represents images in a lower dimensional subspace that is formed from PCA using the estimated covariance matrix. The method then applies Wiener filtering to correct the CTFs and denoise the images in this reduced subspace.
Compared to the standard PCA problem, the cryo-EM setting exhibits further computational challenges, since the estimation method also has to account for convolution with the point spread function, which destroys information of the resulting convolved function; see Section Section 2.2 for a more precise statement. On the other hand, the problem has additional symmetries making fast algorithms possible. In this paper, we present a new fast algorithm for estimating the covariance matrix that improves upon past approaches (especially when there are a large number of defocus groups) in terms of time and space complexity.
1.2. Main contribution
The main contribution of this paper is a new computational method for estimating the covariance Equation (3) from $ N $ measurements of the form Equation (1) encoded by $ L\times L $ digitized images. The presented fast method has time complexity $ O\left({NL}^3+{L}^4\right) $ independent of the number $ M\le N $ of defocus groups. This is in contrast to past methods, where this complexity scales poorly with $ M $ and involves $ O\left({MTL}^4+{NL}^3\right) $ operations(Reference Bhamre, Zhang and Singer15), where $ T $ is the number of iterations needed in a Conjugate Gradient step. Many modern cryo-EM experimental datasets fall into the computationally challenging regime where $ M $ scales with $ N $ .
Our fast method hinges on a new fast and accurate method for expanding $ L\times L $ images into the Fourier–Bessel basis, which provides a convenient way to handle convolution of radial functions (such as point spread functions) with images: namely, convolution with radial functions can be expanded as a diagonal operator operating on the basis coefficients(Reference Marshall, Mickelin and Singer16).
The Fourier–Bessel basis functions are harmonics on the disk: the standing waves associated with the resonant frequencies of a disk-shaped drum with a fixed boundary. More precisely, the harmonics on the disk are eigenfunctions of the Laplacian on the unit disk that satisfy Dirichlet boundary conditions. In computational mathematics, this basis is referred to as the Fourier–Bessel basis, since the basis functions can be expressed as a product of a Bessel function and a complex exponential; see Equation (6) for a definition.
Because of this simple structure, the covariance matrix of clean images can be estimated by a simple closed-form solution, without using the (computationally expensive) conjugate gradient method from previous approaches. Simultaneously, the covariance matrix retains its block diagonal structure, meaning that its diagonal blocks can be estimated separately and independently, which altogether makes PCA fast.
We present numerical results of covariance estimation on synthetic and experimental data. Additionally, we show how the estimated covariance matrix can be used to denoise images using CWF, and perform PCA to visualize eigenimages from experimental data. Code implementing the method is publicly available online.Footnote 1 Moreover, our approach has the potential to generalize to settings beyond cryo-EM, where PCA is used for signals estimated under more general group actions(Reference Bandeira, Blum-Smith, Kileel, Perry, Weed and Wein17).
The remainder of the article is organized as follows. In Section 2, we describe the computational method. In Section 3, we present numerical results for synthetic data. In Section 4 we present numerical results for experimental data. In Section 5, we discuss the results and possible extensions.
2. Methodology
2.1. Notation
For two $ M\times N $ -matrices $ A $ and $ B $ , we denote their Hadamard (or entrywise) product by $ A\odot B $ , the Hadamard division of $ A $ and $ B $ by $ A\oslash B $ and the $ \mathrm{\ell} $ th Hadamard power of $ A $ by $ {A}^{\odot \mathrm{\ell}} $ . These operations are defined elementwise by
respectively. If $ w $ is an $ N $ -dimensional vector, then $ \operatorname{diag}(w) $ denotes the $ N\times N $ matrix with $ w $ along its diagonal, that is, $ \operatorname{diag}{(w)}_{jj}={w}_j $ , and zeros elsewhere. If $ f:{\mathrm{\mathbb{R}}}^2\to \mathrm{\mathbb{R}} $ is a radial function, we write $ f(x)=f\left(|x|\right) $ to mean that $ f $ can be expressed as a function only of the magnitude $ \mid x\mid $ of $ x $ .
For an integrable function $ f:{\mathrm{\mathbb{R}}}^2\to \mathrm{\mathbb{R}} $ , we denote the Fourier transform of $ f $ by $ \hat{f}:{\mathrm{\mathbb{R}}}^2\to \mathrm{\mathbb{C}} $ with the convention that
2.2. Technical details
We make the following assumptions
-
A1. We assume that the point spread functions $ {h}_i $ in the model Equation (1) are radial functions; this implies that their Fourier transform (the CTFs) are also radial. In systems where astigmatism is present and the point spread function deviates slightly from a radial function, our approach can be used as an initial approximation that could be refined using the Conjugate Gradient method.
-
A2. We assume that the underlying images $ {f}_i $ in the model Equation (1) are i.i.d. random variables whose distribution is invariant to in-plane rotations.
-
A3. We assume a technical condition on the Fourier transform of the point spread functions $ {h}_i $ in the model Equation (1). Namely, that the Fourier transforms $ {\hat{h}}_i $ of the $ {h}_i $ satisfy
where $ \delta >0 $ is fixed and $ \left[{\lambda}_{\mathrm{min}},{\lambda}_{\mathrm{max}}\right] $ is the interval of Fourier space used in the disk harmonic expansion, see Ref. (Reference Marshall, Mickelin and Singer16, Section 2.4).
Informally speaking, Assumption A3 can be interpreted as saying that for any pair of frequencies $ \xi $ and $ \eta $ there is a point spread function $ {h}_i $ such that neither $ {\hat{h}}_i\left(\xi \right) $ nor $ {\hat{h}}_i\left(\eta \right) $ are zero. Assumption A1 implies that $ {\hat{h}}_i\left(\xi \right)={\hat{h}}_i\left(|\xi |\right) $ is a radial function. Figure 1 shows the values of $ {\sum}_{i=1}^N{\left|{\hat{h}}_i\left(\xi \right)\right|}^2{\left|{\hat{h}}_i\left(\eta \right)\right|}^2 $ for each pair of radial frequency $ \xi, \eta $ in log scale, for 1081 distinct CTF images of size $ L=360 $ , whose defocus values range from 0.81 to 3.87 m, for an experimental dataset; see Section 4 for more details.
This assumption is much weaker than assuming that for all $ i $ that $ \mid {\hat{h}}_i\mid $ does not vanish at any frequency. In the latter case, we could just use $ {h}_i $ to invert each equation to get access to the underlying functions $ {f}_i $ . If we had direct access to the underlying functions $ {f}_i $ , then we could approximate the covariance function $ c $ by the sample covariance
where $ \tilde{f}\left({x}^{\prime}\right)={\sum}_{i=1}^N\frac{1}{N}{f}_i\left({x}^{\prime}\right) $ is the sample mean. Indeed, $ {c}_N\left({x}^{\prime },{y}^{\prime}\right)\to c\left({x}^{\prime },{y}^{\prime}\right) $ by the law of large numbers. To clarify why it is useful for $ {\hat{h}}_i $ not to vanish, note that in the Fourier domain our measurement model can be expressed as
where $ {\hat{g}}_i,{\hat{h}}_i,{\hat{f}}_i $ , and $ {\hat{\varepsilon}}_i $ denote the Fourier transforms of $ {g}_i,{h}_i,{f}_i, $ and $ {\varepsilon}_i $ , respectively. Since taking the Fourier transform changes convolution to point-wise multiplication, if each Fourier transform $ \mid {\hat{h}}_i\mid \ge \delta $ for some $ \delta >0 $ , then we could estimate $ c $ by first estimating $ {\hat{f}}_i $ from $ {\hat{g}}_i $ for $ i=1,\dots, N $ , and then using the sample covariance matrix. However, in practice, the CTF $ {\hat{h}}_i $ is approximately a radial function with many zero-crossings, which means that multiplication by $ {\hat{h}}_i $ destroys information in the corresponding frequencies, making the restoration from a single image ill-posed.
2.3. Fourier–Bessel
The main ingredient in our fast covariance estimation is a fast transform into a convenient and computationally advantageous basis, known as the Fourier–Bessel basis (which consists of the harmonics on the disk: eigenfunctions of the Laplacian on the disk that obey Dirichlet boundary conditions). This specific choice of basis has a number of beneficial properties:
-
(i) it is orthonormal,
-
(ii) it is ordered by frequency,
-
(iii) it is steerable, that is, images can be rotated by applying a diagonal transform to the basis coefficients,
-
(iv) it is easy to convolve with radial functions, that is, images can be convolved with radial functions by applying a diagonal transform to the basis coefficients.
These properties have made the Fourier–Bessel basis a natural choice in a number of imaging applications(Reference Zhao and Singer7, Reference Bhamre, Zhang and Singer15, Reference Rangan, Spivak, Andén and Barnett20–Reference Zhao, Shkolnisky and Singer22) and will be central to the development of our fast covariance estimation method. In polar coordinates $ \left(r,\theta \right) $ in the unit disk $ \left\{\left(r,\theta \right):r\in \right[0,1\left),\theta \in \right[0,2\pi \left)\right\} $ , the Fourier–Bessel basis functions are defined by
where $ {\gamma}_{nk} $ is a normalization constant, $ {J}_n $ is the $ n $ -th order Bessel function of the first kind (see Ref. Reference Lozier23, Section 10.2), and $ {\lambda}_{nk} $ is its $ k $ th smallest positive root; the indices $ \left(n,k\right) $ run over $ \mathrm{\mathbb{Z}}\times {\mathrm{\mathbb{Z}}}_{>0} $ .
Recent work(Reference Marshall, Mickelin and Singer16) has devised a new fast algorithm to expand $ L\times L $ -images into $ \sim {L}^2 $ Fourier–Bessel basis functions. Informally speaking, given $ \sim {L}^2 $ basis coefficients, the algorithm can evaluate the function on an $ L\times L $ grid in $ O\left({L}^2\log L\right) $ operations; the adjoint can be computed in the same number of operations, which makes iterative methods fast. Compared to previous expansion methods(Reference Zhao and Singer7,Reference Zhao, Shkolnisky and Singer22) , it enjoys both theoretically guaranteed accuracy and lower time complexity.
2.4. Key property of Fourier–Bessel basis
A key property of the Fourier–Bessel basis is that convolution with radial functions is diagonal transformations in any truncated basis expansion. More precisely, the following result holds (Ref. Reference Marshall, Mickelin and Singer16, Lemma 2.3): suppose that $ f={\sum}_{\left(n,k\right)\in I}{\alpha}_{nk}{\psi}_{nk} $ for some index set $ I $ , and $ h=h\left(|x|\right) $ is a radial function. Then,
where $ {P}_{\mathrm{\mathcal{I}}} $ denotes orthogonal projection onto the span of $ {\left\{{\psi}_{nk}\right\}}_{\left(n,k\right)\in I} $ , $ \hat{h} $ is the Fourier transform of $ h $ , and $ {\lambda}_{nk} $ is the $ k $ th positive root of $ {J}_n $ .
We emphasize that the weights of the diagonal transform in Equation (7) are not the coefficients of $ h $ in the disk harmonic expansion. Indeed, since $ h $ is radial, it follows from Equation (6) that the coefficients $ {\beta}_{nk} $ of $ h $ in the basis $ {\psi}_{nk} $ satisfy $ {\beta}_{nk}=0 $ when n ≠ 0. Computing the weights $ \hat{h}\left({\lambda}_{nk}\right) $ from the coefficients $ {\beta}_{nk} $ of $ h $ in the basis, would involve computing weighted sums of the Fourier transforms of the basis functions: $ \hat{h}\left({\lambda}_{nk}\right)={\sum}_{l\in \mathcal{J}}{\beta}_{0l}{\hat{\psi}}_{0l}\left({\lambda}_{nk}\right) $ , for some index set $ \mathcal{J} $ .
As an alternative to the disk harmonic basis expansion, one can consider simply taking the Fourier transform of (1), which also leads to a diagonal representation of the convolution operator. However, the discrete Fourier transform does not have the steerability property, which is essential for the covariance estimation. Another attempt could be to use the polar Fourier transform. However, this representation is not invariant to arbitrary in-plane rotations, but only to finitely many rotations as determined by the discretization spacing of the grid. These expansions are therefore unsuitable for the goal of this article and we instead use expansions into the Fourier–Bessel basis, although other steerable bases could be considered(Reference Landa and Shkolnisky24,Reference Landa and Shkolnisky25) . Table 1 summarizes the considerations that make the Fourier–Bessel basis a natural choice of basis.
a The basis expansion from Cartesian grid representation can be completed within $ O\left({L}^2\right) $ operations up to log factors.
b The new expansion algorithm(Reference Marshall, Mickelin and Singer16) improves the previous computational method(Reference Zhao and Singer7) in terms of accuracy guarantees, computational complexity, and the fact that it derives weights such that radial convolution is a diagonal operation.
2.5. Block diagonal structure
The steerable property of the Fourier–Bessel basis implies that the 2D-covariance will be block diagonal in this basis. A full representation of an $ {L}^2\times {L}^2 $ matrix requires $ O\left({L}^4\right) $ elements, but this is reduced to $ O\left({L}^3\right) $ nonzero entries by the block diagonal structure. This block diagonal structure follows from the form of the basis functions and that the distribution of in-plane rotations is assumed to be uniform. Indeed, suppose that $ f={\sum}_{\left(n,k\right)\in I}{\alpha}_{nk}{\psi}_{nk}. $ For simplicity assume that $ f $ has mean zero; subtracting the mean will only change the radial components, since the other components corresponding to nonvanishing angular frequencies have zero mean by merely averaging over all possible in-plane rotations. By Assumption A2, the covariance function in polar coordinates satisfies
for all $ \varphi $ in $ \left[0,2\pi \right] $ . The covariance function in Equation (3) can be expanded in a double Fourier–Bessel basis expansion as
where $ {C}_{\left( nk,{n}^{\prime }{k}^{\prime}\right)} $ is the covariance matrix in the Fourier–Bessel basis. Combining Equations (8) and (9) and integrating $ \varphi $ over $ \left[0,2\pi \right] $ gives
where $ {\delta}_{n={n}^{\prime }} $ is a Dirac function that is equal to $ 1 $ if $ n={n}^{\prime } $ and zero otherwise, and we note that the second to last equality uses the fact that $ {\psi}_{nk}\left(r,\theta \right)={\gamma}_{nk}{J}_n\left({\lambda}_{nk}r\right){e}^{in\theta} $ . Since the coefficients $ {C}_{\left( nk,{n}^{\prime }{k}^{\prime}\right)} $ in the expansion Equation (9) are unique, it follows from Equation (10) that $ {C}_{\left( nk,{n}^{\prime }{k}^{\prime}\right)}=0 $ when . Hence, the covariance matrix $ {C}_{\left( nk,{n}^{\prime }{k}^{\prime}\right)} $ has a block diagonal structure whose blocks consist of the indices $ \left( nk,n{k}^{\prime}\right) $ for a given value of $ n $ . In the following section, we show how these properties enable a fast method to estimate the covariance matrix.
2.6. Covariance estimation
In the Fourier–Bessel basis, Equation (1) is written as
where $ {G}_i $ , $ {F}_i $ , and $ {E}_i $ are coefficient vectors of $ {g}_i,{f}_i,{\varepsilon}_i $ in the Fourier–Bessel basis, respectively and $ {H}_i $ is the vector encoding the convolution operator of Section 2.4, that is, with components $ \hat{h_i}\left({\lambda}_{nk}\right) $ . The vectors are b-dimensional column vectors, where $ b=O\left({L}^2\right) $ is the number of basis coefficients. We use this simple structure to obtain a closed-form expression for the sample covariance matrix of the $ {F}_i $ . We estimate this matrix by minimizing the discrepancy between the sample covariance and the population covariance; more precisely, the estimated covariance matrix $ \tilde{C} $ is computed by solving the least squares-problem
where $ \tilde{\mu} $ is defined by
whose solution is
The least squares-solution of Equation (12) can be determined by the following system of linear equations
where
It follows that
As discussed in Section 2.5, the covariance matrix is block diagonal in the Fourier–Bessel basis. More precisely, the only nonzero elements $ \tilde{C}\left( nk,{n}^{\prime }{k}^{\prime}\right) $ of the matrix $ \tilde{C} $ are those with $ n={n}^{\prime } $ . Therefore, the matrices in Equations (16) and (17) need only be calculated for this subset of indices. Since there is a total of $ O\left({L}^3\right) $ of these indices, this reduces the computational complexity compared to computing with the full matrices.
Note that the covariance matrix estimated from (Equation 17) may not be positive semidefinite due to subtraction of the term $ {\sigma}^2\operatorname{diag}\left({H}_i^{\odot 2}\right) $ . Therefore, when running the method in practice it is beneficial to use an eigenvalue shrinkage method. The computational cost of eigenvalue shrinkage for a matrix with our block structure is $ O\left({L}^4\right) $ (Reference Bhamre, Zhang and Singer15, Reference Donoho, Gavish and Johnstone26). For completeness, we include this computational cost of eigenvalue shrinkage in our overall computational complexity. Informally speaking, the idea of eigenvalue shrinkage is to replace the term $ {\sum}_{i=1}^N\left[{B}_i\odot \left({H}_i{H}_i^T\right)-{\sigma}^2\operatorname{diag}\left({H}_i^{\odot 2}\right)\right] $ in Equation (17) by $ {\sum}_{i=1}^N\left[{B}_i\odot \left({H}_i{H}_i^T\right)\right] $ , and then shrink and truncate the eigenvalues in a systematic way before Hadamard division(Reference Donoho, Gavish and Johnstone26). The steps of the algorithm are summarized in Algorithm 1.
Algorithm 1 Fast covariance estimation method.
Input: Observed images $ {g}_i $ , radial functions $ {h}_i $ , $ i=1,\dots, N $ .
Output: Estimated covariance matrix $ \tilde{C} $ in the domain of the Fourier–Bessel basis
-
1. Expand observed images $ {g}_i $ into the Fourier–Bessel basis, with resulting coefficient vector $ {G}_i $
-
2. Compute the vectors $ {H}_i $ with components $ \hat{h_i}\left({\lambda}_{nk}\right) $ , representing the action of the CTFs in the Fourier–Bessel basis
-
3. Compute sample mean $ \tilde{\mu} $ from Equation (14)
-
4. Use Equation (16) to compute the elements $ {B}_i\left( nk,n{k}^{\prime}\right) $ for all $ n,k,{k}^{\prime } $
-
5. Use Equation (17) to compute the elements of the sample covariance matrix $ \tilde{C}\left( nk,n{k}^{\prime}\right) $ for all $ n,k,{k}^{\prime } $ , refined using eigenvalue shrinkage
The complexity of step 1 of the algorithm is $ O\left({NL}^2\log L\right) $ . The complexity of step 2 is $ O\left({ML}^2\log L\right) $ . The complexity of step 3 is $ O\left({NL}^2\right) $ , since the number of basis coefficients is $ O\left({L}^2\right) $ . The complexity of step 4 is $ O\left({NL}^3\right) $ , since there are at most $ O\left({L}^3\right) $ nonzero elements with indices $ \left( nk,n{k}^{\prime}\right) $ . The complexity of step 5 is $ O\left({NL}^3+{L}^4\right) $ , where the additional term $ \mathcal{O}\left({L}^4\right) $ comes from the computational complexity of eigenvalue shrinkage. Thus, the total complexity of the algorithm is $ O\left({NL}^3+{L}^4\right) $ .
We now show how this can be used in an application to image denoising. Given an estimate of the covariance matrix $ \tilde{C} $ , the CWF approach estimates the $ {F}_i $ by a linear Wiener filter(Reference Bhamre, Zhang and Singer15),
See Ref. Reference Bhamre, Zhang and Singer15 for more details.
3. Synthetic Data Results
We compare the timings of our fast method to previous approaches(Reference Bhamre, Zhang and Singer15), for synthetic images generated from the 3-D volume of SARS-CoV-2 (Omicron) spike complexes(Reference Guo, Gao and Li27) (EMD-32743), from the online EM data bank(Reference Lawson, Patwardhan and Baker28). The original volume has size $ 512 $ in each dimension, with pixel size 0.832 Å. We downsample the original volume to size $ L\times L\times L $ , with $ L=\mathrm{32,128} $ and 512, respectively, and show the computational times. To generate the synthetic noisy images, we first generate 10,000 clean projection images of the 3-D volume from random and uniformly distributed viewing directions. We next divide the set of clean images into a number of defocus groups, where the defocus values range from 1 μm to 4 μm. For all CTFs, we set the voltage as 300 kV and the spherical aberration as 2 mm. After convolving the images with their CTFs, we add colored noise with power spectral density $ 1/\left( rL/20+1\right) $ up to a constant scale, where $ r\in \left[0,1\right] $ is the radial frequency. For both the previous method and ours, the CTFs and the noisy images are whitened before estimating the covariance. A few sample clean and noisy images are shown in Figure 5. All experiments were carried out on a machine with 750 GB memory and 72 Intel Xeon E7–8880 v3 CPUs running at 2.30GHz. We note that our implementation heavily relies on packages such as FINUFFT(Reference Barnett29,Reference Barnett, Magland and Klinteberg30) , NumPy(Reference Harris, Millman and van der Walt31), and SciPy(Reference Virtanen, Gommers and Oliphant32), which are not fully optimized for parallel computing. The implementation of our fast method and the code of CWF(Reference Bhamre, Zhang and Singer15) both effectively use only around 20 cores at runtime. Due to the complexity of modern computational architectures, performing a fair comparison can be challenging, but we have made every effort to do so. Both our code and the CWF code(Reference Bhamre, Zhang and Singer15) take input images in a tensor format, and used vectorized operations whenever possible to avoid for-loops. The primary factor determining performance is not the specific implementation of the code, but rather the underlying computational complexities. Improving the parallelization is a technical direction for future work.
Figure 2 shows the time required to estimate the covariance matrices as a function of the number of defocus groups. Note that the runtime for the previous approach for the largest image and defocus group sizes is infeasibly large and that our fast method exhibits a speedup of up to three orders of magnitude.
Figure 3 shows the top six principal components estimated by our method and by traditional PCA using $ {10}^4 $ raw images, where $ L=128 $ , $ \mathrm{SNR}=0.1 $ and $ M=100 $ defocus groups, compared to traditional PCA on $ {10}^6 $ images. We use the sample covariance matrix of phase-flipped images in real space for the traditional PCA. For all methods, we use $ \lambda $ to denote $ 100 $ times the eigenvalues of the eigenimages. The eigenimages from the traditional PCA look much noisier than ours, and contain artifacts that are due to imperfect CTF correction (see, e.g., the circular artifacts in top three eigenimages in Figure 3). The eigenimages from the traditional PCA also fail to preserve the symmetries (see, e.g., 6th eigenimage in Figure 3) that are present in our eigenimages, since they do not utilize the steerable basis and rotation-augmented images.
Figure 4 shows the quality of covariance estimation and image denoising when $ L=128 $ and using $ M=100 $ defocus groups. The quality of the covariance estimation is measured by the relative error in each angular frequency, which is defined as
where $ {C}_n $ and $ {\tilde{C}}_n $ are respectively the $ n $ th diagonal blocks of the clean and estimated covariance matrix, corresponding to all indices of the form $ \left( nk,n{k}^{\prime}\right) $ . The clean covariance matrix was approximated by the sample covariance matrix of $ {10}^6 $ clean projection images. The performance of image denoising is measured by the Fourier ring correlation (FRC) between the clean and denoised images. Namely, for the $ i $ th pair of clean and denoised images $ {I}_i^c $ and $ {I}_i^d $ , we first compute their Fourier coefficient vectors $ {\boldsymbol{f}}_{i,r}^c,\hskip0.3em {\boldsymbol{f}}_{i,r}^d $ at radial frequency $ r $ by the nonuniform FFT(Reference Dutt and Rokhlin33–Reference Lee and Greengard35), where $ 1\le i\le N $ and $ 0\le r\le {r}_{\mathrm{max}} $ . We then compute their averaged correlation for each $ r $ :
where $ \left\langle \cdot, \cdot \right\rangle $ denotes the inner product between two complex vectors. The FRC is a real-valued quantity due to a symmetry property that arises since the images $ {I}_i^c,{I}_i^d $ are real-valued.
In addition to the speedups of Figure 2, the left panel of Figure 4 demonstrates a slight increase in the estimation quality of our proposed algorithm, compared to the previous approach. This is possibly caused by improved accuracy in the Fourier–Bessel basis approximation as well as improved accuracy by using the closed-form expression Equation (17) compared to the approximate conjugate gradient step of previous approaches. Similarly, on the right panel, the Fourier ring correlation between the denoised and clean images shows a slight performance increase. Figure 5 shows sample denoised images for different values of the SNR where $ L=128 $ and $ M=100 $ . As a comparison, we show images denoised using the approach of this paper and the CWF method(Reference Bhamre, Zhang and Singer15).
4. Experimental Data Results
We conclude by using our method on two experimental datasets obtained from the Electron Microscopy Public Image Archive(Reference Iudin, Korir, Salavert-Torres, Kleywegt and Patwardhan19), namely EMPIAR-10028(Reference Wong, Bai and Brown18) and EMPIAR-10081(Reference Rangan, Spivak, Andén and Barnett36). EMPIAR-10028 is a dataset of the Plasmodium falciparum 80S ribosome bound to the antiprotozoan drug emetine whose 3-D reconstruction is available in the EM data bank as EMD-2660(Reference Wong, Bai and Brown18). The dataset contains 105247 motion corrected and picked particle images, from 1081 defocus groups, of size $ 360\times 360 $ with 1.34 Å pixel size. EMPIAR-10081 is a dataset of the human HCN1 hyperpolarization-activated cyclic nucleotide-gated ion channel, whose 3-D reconstruction can be found in the EM data bank as EMD-8511(Reference Lee and MacKinnon36). The dataset contains 55,870 motion corrected and picked particle images, from 53,384 defocus groups, of size $ 256\times 256 $ with 1.3 Å pixel size.
Computational times are shown in Table 2, showing a speedup of more than two orders of magnitude for the datasets with the largest number of distinct CTFs. Note that regular CWF on EMPIAR-10081 encounters memory issues and cannot be run to completion, whereas our fast method runs seamlessly. The old CWF has much higher space complexity $ O\left({ML}^3\right) $ for covariance estimation than that of our method $ O\left({L}^3\right) $ . The reason is that the old CWF represents CTFs using block diagonal matrices (each takes $ O\left({L}^3\right) $ memory). Moreover, the old CWF loads all these block diagonal CTFs at once in memory so as to define the linear operator for the conjugate gradient method. Our method circumvents the need to store all CTFs at once due to our closed-form solution, and therefore we can read CTFs in batches. Moreover, each CTF has diagonal representation in our method, which means much lower space complexity $ O\left({L}^2\right) $ . The $ O\left({L}^3\right) $ space complexity for our method is only for storing a single covariance matrix.
Note. For EMPIAR-10081, when $ L=256 $ , the old CWF method(Reference Bhamre, Zhang and Singer15) encounters memory issues and cannot be run until completion. $ {T}_{\mathrm{ffb}} $ , the time required to expand all images in the Fourier–Bessel basis (step 1 in Algorithm 1); $ {T}_{\mathrm{ctf}} $ , the time to compute a matrix representation of the application of the point spread function (step 2 in Algorithm 1); $ {T}_{\mathrm{cov}} $ , the time to estimate the covariance matrix (steps 3–5 in Algorithm 1); $ {T}_{\mathrm{denoise}} $ , the time to denoise the number of images indicated in the main text (2014 images for EMPIAR-10028 and 502 images for EMPIAR-10081); $ {T}_{\mathrm{total}} $ , the total computational time.
For EMPIAR-10081, storing the CTFs for the old method takes approximately $ 50,000\times \left({256}^3/6\right)\times 8/{10}^9\approx 1118 $ GB which is above the 750 GB memory limit of our machine. For the new method, storing the covariance matrix takes $ \left({256}^3/6\right)\times 8/{10}^9\approx 0.02 $ GB and storing the CTFs takes $ 1000\times {256}^2\times 8/{10}^9\approx 0.52 $ GB (with batch size 1000) that can be even handled by a common laptop.
In order to obtain a comparison, we therefore additionally downsample these images to $ L=128 $ where the original CWF can successfully run. On EMPIAR-10028, we used all images for covariance estimation, and denoised the 2014 images from $ 0 $ th, $ 50 $ th, $ 100 $ th, …, $ 1000 $ th defocus groups. On EMPIAR-10081, we used all images for covariance estimation, and denoised the 502 images from $ 0 $ th, $ 100 $ th, $ 200 $ th, …, $ 50,000 $ th defocus groups. For both old and new methods, the covariance matrices are further refined to correct image contrast variations(Reference Shi and Singer21). Sample visualization results are shown in Figures 6 and 7. The old CWF method cannot handle the full resolution recovery for EMPIAR-10081 (with about 50,000 distinct CTFs) due to its high space complexity. Moreover, the old CWF method does not return satisfying restored images for downsampled low resolution images. The reason is that for EMPIAR-10081, almost all images belong to different defocus groups, and the estimated noise power spectrum from a single image is too noisy. Therefore, EMPIAR-10081 is a failure case for the old CWF method in terms of both accuracy and computation time. In contrast, our method assumes radially symmetric noise power spectrum and estimates it using NUFFT over concentric rings with different radii, and averages over each ring. This averaging process largely reduces the noise in the estimated power spectrum.
5. Discussion
Covariance estimation and PCA of cryo-EM images are key ingredients in many classic cryo-EM methods including multivariate statistical analysis(Reference Van Heel3–Reference van Heel, Portugal and Schatz6) and Kam’s method for ab initio modeling(Reference Kam8). We propose a fast method to estimate the covariance matrix of noisy cryo-EM images, and then illustrate its application to simultaneously correct for the CTFs and denoise the images. The approach relies on recent improvements to algorithms for expanding images in the Fourier–Bessel basis(Reference Marshall, Mickelin and Singer16), and has time complexity $ O\left({NL}^3+{L}^4\right) $ which is independent of the number of defocus groups. Our new approach is both significantly faster and more memory-efficient compared to the previous CWF method(Reference Bhamre, Zhang and Singer15) and we apply our method to large experimental datasets with many distinct CTFs with speedups by factors up to more than two orders of magnitude. Our approach could potentially be extended to higher-dimensional data and to the setting where images are distorted by CTFs which are not exactly radial, using either analytical correction terms or iterative numerical steps.
Acknowledgments
We are grateful to Kenny Huang for running some numerical experiments in the initial stage of this project.
Competing Interests
The authors declare no competing interests exist.
Authorship Contributions
N.F.M., O.M., Y.S., and A.S. conceived the project. N.F.M., O.M., and Y.S. designed the algorithms. Y.S. wrote the software and performed the experiments. All authors wrote the article and approved the final submission.
Funding Statement
N.F.M. was supported in part by NSF DMS-1903015 and was visiting the Institute for Pure and Applied Mathematics (IPAM) when part of this research was performed, which was supported by NSF DMS-1925919. A.S. was supported by grants from the AFOSR FA9550-20-1-0266; the Simons Foundation Math+X Investigator Award; NSF BIGDATA Award IIS-1837992; NSF DMS-2009753; and NIH/NIGMS 1R01GM136780–01.
Data Availability Statement
Replication data and code can be found at https://github.com/yunpeng-shi/fast-cryoEM-PCA.