1. Introduction
Viscous flow of ice in glaciers and ice sheets is governed by gravitational driving forces and resisting tractions at ice-rock boundaries, as well as internal stresses resulting from stretching and compression. For laterally confined ice shelves that flow within embayments, flow is resisted by shear stresses at the margins where faster-flowing ice is in contact with rock or immobile ice. Basal shear stresses can further resist flow where ice is locally grounded at ice rises or pinning points. The total resistance, or buttressing, provided by ice shelves to upstream grounded ice is a key modulator for potential changes in flow speed of the grounded ice to changes in atmospheric or oceanic conditions. However, accurate quantification of buttressing stresses and modeling of ice shelf flow depends on well-calibrated estimates of ice rheological parameters throughout the modeling domain.
Observations of ice flow, whether in an experimental or natural setting, are the only means by which we can infer mechanical properties such as ice rheology. Specifically, spatially continuous flow velocity measurements permit robust strain rate estimates, which can have considerable spatial variability due to differing flow regimes, ice rheology, ice geometry and other factors. These strain rates are linked to stresses within the ice through an appropriate constitutive relation, where the most commonly used relation is Glen's Flow Law:
where τ ij is the deviatoric stress tensor, η is the effective dynamic viscosity, B is the ice rigidity, n is the stress exponent, $\dot {\epsilon }_{ij}$ is the strain rate tensor and $\dot {\epsilon }_{\rm e} = \sqrt {\dot {\epsilon }_{ij}\dot {\epsilon }_{ij}/2}$ (where we apply the summation convention for repeated indices) is the effective strain rate computed as the square root of the second invariant of the strain rate tensor (Glen, Reference Glen1958). Note that a prefactor defined as A = B −n is also commonly used in Glen's Flow Law. All of the terms in Eqn (1) vary spatially with different intrinsic lengthscales. The stress exponent, n, is set by the dominant mechanisms of creep that drive the deformation of ice and is dependent on the stress regime, grain size, ice temperature and crystallographic fabric (Goldsby and Kohlstedt, Reference Goldsby and Kohlstedt2001). The prefactor, B, which we refer to as the ice rigidity, shares the same dependencies as the exponent, in addition to interstitial water content, impurities and damage (Cuffey and Paterson, Reference Cuffey and Paterson2010). Thus, both B and n are lumped parameters in Glen's Flow Law that represent a combination of factors and mechanisms which generally cannot be observed continuously at the scale of ice shelves and ice sheets. Rather, B and n must be inferred from ice surface velocity and thickness observations for each area of interest.
To construct a tractable inverse problem, Glen's Flow Law is first injected into an appropriate dynamical framework (i.e. governing equations for ice flow) to obtain a non-linear mapping from parameters (B and n) to observables (ice velocity) over the entire modeling domain. This mapping, or forward problem, can be used in an optimization framework to estimate parametric values that optimally reconstruct the surface observations (MacAyeal, Reference MacAyeal1989, Reference MacAyeal1993). The outcome of the inverse problem, a 2D map of B and n, can then be used in Glen's Flow Law to compute stresses within the ice, which allows for further prognostic simulations to project the evolution of ice flow for a given study area in response to changing climatic conditions. However, for static datasets, i.e. snapshots of velocity and thickness at a given time epoch, B and n cannot be uniquely determined, and independent constraints on one of the parameters is required to reduce the non-uniqueness. In this work, we focus only on inference of a spatially varying rigidity B, noting that recent work has demonstrated that n may be estimated in Greenland and Antarctica independently under certain flow conditions (e.g. Bons and others, Reference Bons2018; Millstein and others, Reference Millstein, Minchew and Pegler2022), leading to a value of n ≈ 4 which is consistent with experimental analysis of ice deformation under realistic pressure environments and strain rates (Qi and Goldsby, Reference Qi and Goldsby2021).
Still, estimation of the optimal rigidity field is equivalent to drawing only a single sample of B from the total statistical distribution of fields that could explain the observations nearly as well as the optimal one. This distribution is influenced by observational uncertainties as well as modeling uncertainties. For the latter, modeling uncertainties can stem from factors such as model resolution (sensitivity of the forward model to variations in parameter values) and model misspecification where the model fails to capture relevant physics or makes improper assumptions about certain aspects of the physics. Overall, quantification of the distribution of parameter values is of equal importance to estimating the optimal values, and it is ultimately necessary for obtaining a realistic distribution of future ice states conditioned on current-day observations (Aschwanden and others, Reference Aschwanden, Bartholomaus, Brinkerhoff and Truffer2021).
In this work, we aim to develop a framework for estimating the distribution of ice rigidity for large study areas that combines information extracted from relevant surface observations with information obtained from prior theories, experimental/observational studies, etc. While such a framework has a long history in Bayesian inference, our primary consideration in this work is a matter of scalability to large datasets as well as to a large number of effective model parameters. To that end, we build upon recent developments in variational inference and physics-informed machine learning to address the problem of scalability. We use a combination of neural networks for modeling continuous surface observations with variational Gaussian Processes for modeling ice rigidity probability distributions. The mapping between surface observations and rigidity is provided by partial differential equations (PDEs) describing ice flow, which ultimately allow us to include a physics-informed loss function to the training objective for the machine learning models. Both classes of models allow for training with stochastic gradient descent, which is critical for scaling the inference method to large datasets. We target select ice shelves in Antarctica for demonstrating the proposed methods as they provide a number of favorable modeling simplifications while maintaining adequate complexity and large spatial extents suitable for examining the advantages and disadvantages of the proposed methods.
2. Methodology
In this section, we will introduce the governing equations for ice flow that link spatial variations in our parameter of interest, ice rigidity, to observations of ice shelf velocity and thickness. We then introduce a probabilistic physics-informed machine learning framework designed to produce a probability distribution of rigidity fields consistent with the surface observations. We discuss how we utilize variational Bayesian techniques to perform inference at the scale of large ice shelves, observed with large datasets.
2.1. Ice flow momentum balance forward model
Given a spatial domain with spatial coordinates specified by x, where for two dimensions x = [x, y], our goal is to estimate the most likely spatial field of ice rigidity, B = B(x), conditional on observations of the flow of ice shelves and their geometry. To that end, we utilize a momentum balance method to estimate B that computes resistive stresses that optimally balance gravitational driving stresses. Within ice shelves, resistive (vertical) shear stresses at the base are negligible due to contact with relatively inviscid seawater. Ice is a thin film with small thicknesses relative to the horizontal dimensions (aerial extent). Thus, we are justified in employing the widely used shallow-shelf approximation (SSA), which assumes negligible vertical shearing in a thin film and vertically integrates viscosity and stresses in the ice column to obtain a simplified 2D framework for the governing equations of flow in ice shelves. We write the SSA momentum balance as:
where u and v are the horizontal velocity components of the velocity vector, u, along the x- and y-directions, respectively, and taken to be constant with depth; h is the ice thickness; s is the ice surface elevation; $2\eta = B \dot {\epsilon }_{\rm e}^{{1-n}/{n}}$ is the effective dynamic viscosity of ice (Eqn 1); ρ i is the mass density of ice (assumed to be constant at ρ i = 917 kg/m3); and g is the gravitational acceleration. In the above formulation, r x and r y on the left-hand side represent residual terms in the momentum balance. For the general SSA as applied to flowing ice where ice is grounded, such as at ice streams, these residual terms are non-zero and correspond to basal drag. However, since drag at the base of ice shelves is assumed to be negligible because of the negligible viscosity of seawater, r x and r y are nominally zero. Thus, we seek to construct the field B(x) that provides an optimal balance between the membrane stresses (first two terms on the right-hand side) and the driving stress (last term on the right-hand side) such that r x and r y are close to zero.
2.2. Reparameterization for ice rigidity
Before proceeding, we first introduce a commonly used reparameterization of the rigidity B:
where B 0 (x) represents some reference field of the rigidity and θ(x) corresponds to logarithmic rigidity variations about B 0 (x). Since B is a strictly positive quantity, this reparameterization allows for the transformation of an inequality-constrained inference problem on B to an unconstrained inference problem on θ while also compressing the dynamic range of rigidity variations to log-space (Shapero and others, Reference Shapero, Badgeley, Hoffman and Joughin2021; Brinkerhoff, Reference Brinkerhoff2022). Analysis of inferred variations in B can thus be performed by inserting any inferred variations in θ into the above equation. Proper choice of the reference rigidity B 0 depends on the prior information available, which we discuss in detail in later sections.
2.3. Probabilistic inference of ice rigidity from remote-sensing data
Observations of horizontal ice velocity over ice-sheet margins have been widely available for the past decade thanks to the prevalence of remote-sensing platforms and efficient data processing methodologies (Joughin and others, Reference Joughin, Smith, Howat, Scambos and Moon2010; Mouginot and others, Reference Mouginot, Rignot, Scheuchl and Millan2017; Gardner and others, Reference Gardner, Fahnestock and Scambos2019). At the same time, improved integration of ice-penetrating radar and surface velocities using mass conservation techniques have allowed for more accurate and higher resolution maps of ice thickness and bathymetry (Morlighem and others, Reference Morlighem2017). Specifically, over ice shelves, it is common practice to convert observations of surface elevation, which are well constrained, to ice thickness by assuming hydrostatic equilibrium and applying corrections for firn layers derived from in situ thickness data (Morlighem and others, Reference Morlighem2020). Thus, we can estimate spatial gradients and compute the SSA momentum balance directly for a given rigidity field when velocity and thickness observations are spatially continuous over an ice shelf. However, observation noise and data gaps generally degrade estimates of observation gradients, resulting in non-physical estimates of SSA terms that require an additional gradient operation. Therefore, we seek to formulate a method for approximating the continuous functions underlying the surface velocity, ice thickness and ice rigidity fields that balance the reconstruction accuracy of the observed data while resulting in minimal SSA residuals, r. Furthermore, such a method should allow for rigorous quantification of the uncertainties associated with θ, which will be driven by observation noise, varying sensitivities of the SSA equations to spatial variables, and prior knowledge on the range of values for θ.
To that end, we utilize Bayes’ Theorem to construct the joint posterior probability distribution for rigidity and the reconstructed surface velocity and ice thickness, conditioned on a set of velocity and thickness observations assimilated into the SSA momentum balance. Let us first define the observation vector d = [u, v, h], reconstructed observation vector $\hat {{\bf d}} = [ \hat {u},\; \, \hat {v},\; \, \hat {h}]$, and SSA residual vector r = [r x, r y]. In the following, we consider that all three vectors vary spatially over an ice shelf, i.e. d = d(x), $\hat {{\bf d}} = \hat {{\bf d}}( {\bf x})$ and r = r(x). We can then write the joint posterior distribution for θ and $\hat {{\bf d}}$ as (Appendix A):
The first two terms on the right-hand side of the equation are the data likelihood distributions, where $p( {\bf d} \vert \hat {{\bf d}})$ measures the probability of observing d for a given set of predictions $\hat {{\bf d}}$ while $p( {\bf r} \vert \theta ,\; \, \hat {{\bf d}})$ measures the probability of computing r (which is nominally [0, 0] for ice shelves) through evaluation of the SSA momentum balance using $\hat {{\bf d}}$ and θ. The last term on the right is the prior distribution p(θ), which encodes our prior knowledge on θ values without having seen any observations.
2.4. Variational inference of ice rigidity
Due to non-linearities in the SSA momentum balance and potentially non-Gaussian data likelihoods, the posterior for θ and $\hat {{\bf d}}$ must be approximated, either by drawing random samples from $p( \theta ,\; \, \hat {{\bf d}} \vert {\bf d},\; \, {\bf r})$ or by constructing a suitable approximating distribution. The former strategy is based on the general class of Markov Chain Monte Carlo (MCMC) approaches and tends to be suitable for a low or moderate number of model dimensions. However, for spatial domains with sizes typical of Antarctic ice shelves, the number of model dimensions will be quite high, regardless of the model used to represent the spatial fields. Therefore, we instead employ a variational inference framework wherein we aim to construct an approximating distribution, $q( \theta ,\; \, \hat {{\bf d}})$, for θ and $\hat {{\bf d}}$ that is minimally divergent from the true posterior $p( \theta ,\; \, \hat {{\bf d}} \vert {\bf d},\; \, {\bf r})$. The choice of approximating distribution generally involves a trade-off between computational complexity (e.g. number of trainable parameters) and accuracy in capturing relevant statistics and correlations in the target posterior (Blei and others, Reference Blei, Kucukelbir and McAuliffe2017). We discuss the choice of approximating distribution in the next section.
At this stage, let us also enforce that $q( \theta ,\; \, \hat {{\bf d}})$ is the product of two independent variational distributions, $q( \theta ,\; \, \hat {{\bf d}}) = q( \theta ) q( \hat {{\bf d}})$. This independence will allow us to use separate machine learning models to reconstruct observations and predict rigidity while using a joint objective function to tune their parameters simultaneously. Moreover, we restrict $q( \hat {{\bf d}})$ to produce only the mean reconstructed velocity and thickness, assuming that information about formal observation uncertainties can be obtained. Later, we will discuss how this assumption on $q( \hat {{\bf d}})$ will affect the posterior variance estimates for θ.
As is commonly done in variational inference, we utilize the Kullback–Leibler (KL) divergence for measuring the difference between two probability distributions:
By minimizing the KL-divergence, we are tuning the variational distribution to be close to the target posterior distribution from an informational perspective. However, the existence of the posterior distribution in the denominator of the log term generally makes computing the KL-divergence intractable. As shown in Appendix A, we can instead maximize a stochastic variational lower bound, referred to as the Evidence Lower Bound (ELBO):
The first term for the ELBO is the expected value of the log-likelihood for the SSA residual vector, r, for a given set of predictions $\hat {{\bf d}}$ and integrated over the variational distribution for θ (see Appendix A for estimation of this integral over ice shelves). The second t erm for the ELBO is the data log-likelihood for the observations d given predictions $\hat {{\bf d}}$. The last term is the KL-divergence between the variational distribution q(θ) and prior p(θ) for the rigidity, which can be computed analytically for certain pairs of distributions, e.g. two multivariate normal distributions. Overall, the ELBO provides an objective function for learning an approximating distribution for the spatial field of rigidity, as well as the mean ice velocity and thickness fields. In the next section, we discuss the choice of approximating distribution q(θ) and the parameterization of spatial fields such that the ELBO can be computed efficiently for large datasets and spatial domains.
2.5. Parameterization of spatial fields and physics-informed machine learning
We seek representations of the mean spatial fields $\hat {{\bf d}}( {\bf x})$ and an approximating distribution q(θ) that permit efficient evaluation of the ELBO in Eqn (7). For ice-sheet modeling in general, spatial fields are typically discretized into an irregular mesh of triangular finite elements. However, this approach can lead to a large number of nodal parameters for a given spatial field, particularly for fields with high spatial variability with length scales on the order of a few ice thicknesses. Large numbers of nodes can severely restrict the choice of approximating distributions that are computationally feasible. For example, for a mesh with N nodes, an approximating multivariate normal distribution with a mean vector of N elements and a covariance matrix of N 2 elements would require excessive computer memory when N is greater than a few thousand nodes.
Our strategy in this work is to adopt mesh-free parameterizations of spatial fields using two different machine learning models: (1) a neural network for the mean fields $\hat {{\bf d}}( {\bf x})$; and (2) a variational Gaussian Process (VGP) for the approximating distribution for the rigidity field θ(x). Both models fall under the broad classification of function approximators for a given set of data, while the VGP (and Gaussian Processes in general) also approximates the probability distribution of functions that generate the given dataset. For the mean fields $\hat {{\bf d}}( {\bf x})$, a dense, feedforward neural network, f ψ, is used to represent the surface observations on a point-by-point basis:
where $\hat {{\bf d}}_i$ is the vector of neural network predictions at the i-th coordinate xi, and ψ represents the total set of weights and biases of the hidden layers. The choice of feedforward networks in this work is motivated by the ability to generate predictions of $\hat {{\bf d}}_i$ at arbitrary coordinates, which facilitates mini-batch training by stochastic gradient descent. For the ice shelf models investigated here, we found that four-layer feedforward networks with 50–100 nodes per layer provide a suitable balance between expressivity and computational efficiency. Additionally, we use tanh activation functions as they are continuously differentiable and result in spatial gradients that are spatially smooth (Riel and others, Reference Riel, Minchew and Bischoff2021).
For the rigidity variational distribution, we exploit the ability of Gaussian Processes (GPs) to construct an approximating multivariate normal distribution through computation of a mean vector for any set of coordinates and a covariance matrix for any pairwise set of coordinates (Rasmussen, Reference Rasmussen2003). Multivariate normal distributions permit modeling of correlations between variables and have been shown to agree well with MCMC-sampled posteriors for ice dynamics inverse problems (Petra and others, Reference Petra, Martin, Stadler and Ghattas2014). The mean and covariance both require the evaluation of a kernel function that describes the expected similarity between data points for a given coordinate pair. Here, we use a squared exponential kernel:
where σ 2 is the kernel variance and L is a covariance length scale that describes the characteristic distance up to which function values at x and x′ are correlated. However, as described in Appendix B, the use of standard GPs to infer an approximating distribution for a given dataset will be computationally prohibitive for large datasets, which stems from the need to evaluate the kernel at N coordinates in order to construct and invert a covariance matrix of size N × N. For our study areas, N is usually on the order of tens of thousands of points. Variational GPs are a modification of standard GPs that utilize the concept of sparse inducing index points for overcoming the prohibitive ${\cal O}( N^2)$ memory and ${\cal O}( N^3)$ time requirements of large sets of index points for standard GPs (Titsias, Reference Titsias2009). Rather than evaluating the kernel at N coordinates, we evaluate the kernel at M inducing index coordinates, z, where M ≪ N. The inducing index points are sometimes referred to as support points and act as a low-dimensional latent representation of the data (Quinonero-Candela and Rasmussen, Reference Quinonero-Candela and Rasmussen2005). As explained in Appendix B, computation of a posterior mean and covariance matrix at an arbitrary number of prediction points involves evaluation of the kernel at the prediction and inducing index points and combining the kernel evaluations with a full-rank covariance matrix at the inducing points. The number of inducing points, M, is a prescribed parameter that provides additional control on the complexity of the predicted variable field. For the ice shelves studied here, we find that M between 300 and 1000 provides sufficient expressiveness for the rigidity fields. Overall, the total set of trainable variables, ϕ, for the VGP consists of the inducing point locations z, the inducing point mean values, the covariance matrix values at the inducing points and the hyperparameters of the kernel (σ 2 and L). In the following, we denote the approximating multivariate normal distribution as q ϕ(θ).
While GPs and VGPs are usually trained in a supervised fashion for a training dataset of N examples, we do not have a set of ground truth rigidity values for θ. We instead insert q ϕ(θ) into the ELBO in Eqn ( 7) in order to train f ψ and q ϕ(θ) simultaneously. This style of training is consistent with recent developments in physics-informed machine learning where available physics knowledge (in our case, the SSA momentum balance) is used to augment observables with pseudo-observables, which can be useful in domains where direct observations are impossible or difficult to obtain (Raissi and others, Reference Raissi, Perdikaris and Karniadakis2019; Riel and others, Reference Riel, Minchew and Bischoff2021). Here, ice surface velocity and thickness are observable, and their reconstructed values (generated by f ψ) are used in conjunction with rigidity samples drawn from q ϕ(θ) in order to compute the SSA residual pseudo-observables, which should nominally be zero. We evaluate the SSA residuals at a random set of C ‘collocation’ coordinates, xc, not included in the training data for f ψ in order to maximize the information extracted from the pseudo-observables (Raissi and others, Reference Raissi, Perdikaris and Karniadakis2019).
The ELBO in Eqn (7) can now be written in terms of the neural network parameters ψ and the VGP parameters ϕ:
where we use the shorthand notation θ(xc) = θ c and θ(z) = θ z to indicate the rigidity field evaluated at the collocation points xc and the inducing index points z, respectively. The neural network f ψ is evaluated at the surface observation coordinates x. In this work, we minimize the negative of the ELBO using the Adam optimizer (Kingma and Ba, Reference Kingma and Ba2014) with a learning rate between 1e−4 and 5e−4 and batch sizes varying from 256 to 1024 examples. Additionally, we often found it beneficial to pre-train the network f ψ using only the middle term of the ELBO (i.e. fitting the observations without the rigidity field). The pre-training results in a f ψ that can then be fine-tuned with the full ELBO in order to improve overall training convergence with the VGP. Typically, we make a qualitative assessment of convergence by monitoring the loss values for training and testing datasets (using a train-test split of 85 and 15% of the data, respectively) and stop training when reductions in the loss value are negligible (Fig. S1). A summary of the neural network and variational inference training framework is described in Figure 1.
2.6. Selection of data likelihood and prior distributions
To concretize the ELBO training objective for the neural network f ψ and VGP q ϕ(θ), we now discuss specification of the data likelihoods for the surface observations and SSA residuals and the prior distribution for the rigidity.
2.6.1. Data likelihood specification
For the likelihood $p( {\bf d} \vert \hat {{\bf d}})$, we use independent normal distributions for the velocity components and ice thickness:
where the different σ 2 variables correspond to the variances of each observable. The observation variances may be prescribed or learned, and in this work, we prescribe the variances to be scaled values of formal observation uncertainties provided with the datasets. We found that a scaling factor between 10 and 30 prevented the neural network f ψ from overfitting the velocity data in order to generate velocities that are more consistent with the SSA momentum balance (e.g. mitigating high spatial frequency velocity variations that are not well-modeled by the SSA approximation). We further note that we apply a limited amount of spatial smoothing to the velocity and thickness data (using a Gaussian filter with a window size of roughly half of the mean ice thickness of the shelf) in order to mitigate high-frequency observation noise and improve training convergence.
For the likelihood $p( {\bf r} \vert \theta ,\; \, \hat {{\bf d}})$, we again use a normal distribution:
where the mean of zero encodes that we expect ${\bf r} = {\bf r}( \theta ,\; \, \hat {{\bf d}})$ to be zero on ice shelves, and $\sigma _{{\bf r}}^2$ is the expected variance of the residuals. Prescribing a proper value of $\sigma _{{\bf r}}^2$ for the likelihood distribution involves a careful consideration of the observation uncertainties and the expected uncertainties in θ. In the general case, one could perform a Monte Carlo estimate of $\sigma _{{\bf r}}^2$ by inputting random realizations of the velocity data and samples of θ from the prior into the SSA equations to get an expected range of values for r. Using this approach, we find that σ r ≈ 1 kPa works well for most cases and can be decreased for lower observation noise. Note that this approach does not explicitly consider situations where the SSA (with zero basal drag for ice shelves) is expected to perform poorly, such as pinning points where ice shelves become locally grounded over bathymetric highs. These epistemic uncertainties should correspond to an increase in $\sigma _{{\bf r}}^2$. An additional improvement to the methods presented here is to use a full variational distribution for $q( \hat {{\bf d}})$, rather than using only the mean values $\hat {{\bf d}}$. Our restriction to the mean values for $\hat {{\bf d}}$ likely reduces the estimated posterior variance for θ, which can be partially compensated by inflating $\sigma _{{\bf r}}^2$. Overall, the likelihood formulation can be improved by modeling the full probability distribution for the surface observations, explicit handling of epistemic uncertainties, incorporation of spatially correlated uncertainties (e.g. Brinkerhoff, Reference Brinkerhoff2022) and using a non-Gaussian probability distribution. We leave these improvements for future exploration.
2.6.2. Prior distribution
As discussed earlier, the VGP uses a squared exponential kernel function for posterior inference of rigidity (Eqn 9). Likewise, we use the squared exponential kernel to construct the GP prior p(θ) in order to encourage spatial coherence between predictions of θ at different coordinates:
where the prior variance and length scale, $\sigma _\theta ^2$ and $L_\theta$, are prescribed and not tuned during minimization of the negative ELBO. For ice dynamics well-described by the SSA approximation, L θ can usually be chosen to be some finite multiple of the mean ice shelf thickness, which is commensurate with the longitudinal stress coupling length scale (Cuffey and Paterson, Reference Cuffey and Paterson2010). The complete prior distribution is written as:
Since we enforce a prior mean of zero, the prior variance $\sigma _\theta ^2$ will depend on the value of B 0 used for the reparameterization of rigidity. In this work, we investigate two different strategies: (i) assume that B 0 is uniform; or (ii) estimating a spatial field B 0(x) independently through an inversion using traditional control methods. For the second strategy, recall that traditional control methods generally use an optimization objective based on the misfit between observed and predicted ice velocities, which is different from the momentum balance optimization objective used here. Therefore, the velocity misfit-based objective will implicitly have different spatially varying sensitives to the parameter field than the momentum balance objective, which provides an opportunity to combine the two objectives in a complementary manner. For the two different strategies for selecting B 0, we also correspondingly adjust the prior variance. For a uniform B 0, we set $\sigma ^2_\theta = 1$ to allow for a relatively large variation in θ over the ice shelf. For a B 0 obtained from a control method inversion, we reduce $\sigma _\theta ^2$ to 0.2, which encodes our belief that the values of B from the inversion are relatively well-constrained, and θ thus represents smaller deviations of B dictated by the momentum balance optimization objective.
2.7. Generating shelf-wide samples of the ice rigidity
While the VGP utilizes inducing index points to allow for per-batch prediction of θ and the corresponding posterior covariance matrices, we require an algorithm for generating a random sample of θ with a given spatial resolution over the entire modeling domain. Even with the use of inducing points, assembling global posterior mean and covariance matrices for the entire domain would require a very large amount of memory for uniform grids with sizes exceeding tens of thousands of grid points (${\cal O}( NM)$ memory requirements for N grid points and M inducing points). We therefore utilize an MCMC-based approach where a random sample of θ is generated at a limited set of coordinates within the ice shelf and used as a seed to grow a full chain over the entire shelf, which is equivalent to block-sampling approaches for sampling Gaussian Markov Random Fields (e.g. Rue, Reference Rue2001). Note that this form of MCMC utilizes the posterior statistics learned from variational inference and does not involve evaluation of a physics-based forward model. Specifically, we apply Gibbs sampling on a block-by-block basis where a block is defined as a small subset of the uniform grid. For each block, the mean ${\boldsymbol \mu }_\theta$ and covariance matrix ${\bf \Sigma }_\theta$ are computed using the trained VGP. In this way, we can grow a random chain through successive evaluation of smaller multivariate normal distributions rather using a multivariate normal with a very large global covariance matrix. As an example, assume that the first two blocks are combined in the following partition:
Then, initialization of a Gibbs chain would use the following identities:
In words, we use the trained VGP to generate a mean vector and covariance matrix for two blocks of coordinates, which are then partitioned according to Eqn (15). A sample θ 1 = θ(x1) is drawn directly from the mean and covariance of the first block. Then, a sample θ 2 = θ(x2) is drawn from a multivariate normal with a mean and covariance that are computed using the Schur complement. We then set μ1 ← μ2, θ1 ← θ2, Σ11 ← Σ22, and use the VGP to predict the mean and covariance for the next set of blocks. The process is repeated for all blocks in the modeling domain. With this approach, each sample is independent of the previous sample and is accepted with a probability of 1. In our experiments, we found that if the block size was chosen to be too small, the variance of the final sample was artificially large, likely due to excessive truncation of the covariance matrix (depending on the resolution of the uniform grid relative to the length scale of the posterior variations in rigidity). Here, we found a block size of 1000 grid points works well for grids with cell sizes equal to roughly half or a quarter of the prior covariance lengthscale. Overall, we can validate the samples derived from the Gibbs sampler by viewing traceplots (plots of samples at independent coordinates) and cross-comparing the standard deviation fields between the sample standard deviation and the standard deviation predicted directly by the VGP (Fig. S6).
2.8. Related work
Bayesian inference has long been applied to geophysical inverse problems, and as computational resources and inference algorithms improve, the complexity and size of the physical models investigated has increased. Within glaciological inverse problems, Bayesian formulations of the posterior distributions have been used as cost functions for obtaining point estimates of basal topography and friction for grounded ice streams (Pralong and Gudmundsson, Reference Pralong and Gudmundsson2011). For fully Bayesian inference, Petra and others (Reference Petra, Martin, Stadler and Ghattas2014) developed an MCMC method for estimating the posterior distribution for ice-sheet models with a large number of parameters, utilizing low-rank approximations of data likelihood Hessian matrices in order to reduce computational complexity while improving sample efficiency. Similarly, Gopalan and others (Reference Gopalan, Hrafnkelsson, Aḥhalgeirsdóttir and Pálsson2021) used a Gibbs sampler in order to sample for ice stream model parameters for a simpler model applicable to slower-flowing ice. While MCMC methods generally serve as ‘gold standards’ for Bayesian inference, they do not scale well to large problem sizes. MCMC methods that invoke simpler proposal distributions usually require many more samples in order to sufficiently sample the posterior, whereas methods that can utilize the problem structure to improve sample efficiency require more computational resources (Petra and others, Reference Petra, Martin, Stadler and Ghattas2014).
Methods that approximate the posterior distribution, rather than sample from it, provide appealing alternatives to MCMC. Both Isaac and others (Reference Isaac, Petra, Stadler and Ghattas2015) and Babaniyi and others (Reference Babaniyi, Nicholson, Villa and Petra2021) utilize a Gaussian approximation of the posterior centered on the maximum a posteriori (MAP) point (i.e. a Laplace approximation) in order to infer basal drag parameters for ice sheets. While Laplace approximations subvert the need for generating posterior samples (and the forward model evaluations associated with each sample), they can lead to posterior approximations that fail to capture much of the probability mass when the posterior is sufficiently non-Gaussian or multi-modal (Penny and others, Reference Penny, Kiebel and Friston2007). In contrast, variational methods that utilize the KL-divergence as an optimization criterion (as done here) tend to favor approximating distributions that match the moments of the target distribution (e.g. mean and variance), which tends to capture more probability mass. Sufficient capturing of probability mass can be especially important for posterior predictive modeling where non-linearities can lead to a large spread of predictions (e.g. see the section on ice shelf buttressing).
To that end, Brinkerhoff (Reference Brinkerhoff2022) introduced a variational inference method to jointly infer basal drag and ice rheology at a catchment-scale for glaciers. Importantly, the KL-divergence was used to estimate an optimal approximating distribution that also uses a Gaussian process prior, similar to the approximating distributions used in our work. A finite number of eigenvectors of the prior covariance are used to construct a linear model that permits inference at a lower dimension, similar to the sparse inducing points used in the VGP. The construction of the eigenvectors utilizes a coarse grid in the Fourier domain to model signals with a discrete set of spatial frequencies. Thus, the method of Brinkerhoff (Reference Brinkerhoff2022) shares many of the same features proposed here, with two main differences. Firstly, we take the ice surface velocity and thickness as given (subject to smoothing) and use the momentum balance based on the SSA as our forward model in order to compute r. On the other hand, all of the previous approaches strictly enforce r = 0 (i.e. they take satisfaction of the SSA momentum balance as a constraint) and use predicted velocities as the forward model. The difference between the two approaches may be interpreted as taking two different pathways to the same solution, where the difference in forward models will lead to different sensitivities to the ice rheology parameters (see Discussion section). Furthermore, our parameterization of the spatial fields with neural networks and VGPs permits independent evaluation of the spatial gradients in the SSA momentum balance using automatic differentiation. This independence allows for batch-based stochastic gradient descent, which is advantageous for large datasets.
3. Application to simulated ice shelves
We now apply the physics-informed variational framework to simulated ice shelves in order to evaluate the recovery of ice rigidity under varying degrees of model complexity and uncertainty and data noise. Furthermore, the simulated ice shelves allow us to isolate which mechanical factors control the inferred rigidity uncertainties, which will aid in building intuition for application of the framework to natural settings.
3.1. 1D ice shelf
We first simulate a laterally confined ice shelf using 1D SSA equations where lateral drag is parameterized assuming a rectangular bed with width w in the across-flow direction (Nick and others, Reference Nick, Van der Veen, Vieli and Benn2010). The momentum equation in the along-flow direction reduces to:
As there are no time derivatives in the momentum balance for ice flow, the above equation is solved for u for a given thickness profile h. The thickness is then evolved using mass conservation:
where a is the accumulation rate (set to 0 for this example) and q = uh is the horizontal ice flux through a vertical column of ice.
3.1.1. Simulation setup
We simulate an ice shelf with a width of 30 km and a length of 100 km, which is comparable to ice shelves of several ice streams in West Antarctica, such as Rutford Ice Stream. We prescribe a spatially varying B profile that is periodic in the along-flow direction while setting the flow law exponent to be uniform at n = 3. After simulating the shelf for 400 years to an approximate steady-state, we extract 200 random velocity and thickness values over the model domain to use as training data. We add spatially correlated noise by convolving a 1D field of independent Gaussian noise with a Gaussian kernel with a lengthscale of 5 km. We train a feedforward neural network with four hidden layers of 50 nodes each in order to reconstruct the velocity and thickness and a VGP with 15 inducing index points in order to predict the log rigidity θ. We use an a priori value of the prefactor B 0 = 400 yr1/3 kPa. For the prior distribution for θ, we describe a prior standard deviation of $\sigma _\theta = 1$ and a correlation lengthscale of $L_\theta = 15$ km. For the data likelihood parameterizing the residual SSA terms, we use an independent normal distribution with a mean of zero and a standard deviation of σ r = 2.0.
3.1.2. Evaluation of variational inference
As is commonly done in studies investigating variational inference techniques to approximate a target posterior distribution, we compare the estimated variational distribution with direct samples from the posterior using MCMC. Here, we utilize a No U-Turn Sampler scheme implemented in the NumPyro Python package (Phan and others, Reference Phan, Pradhan and Jankowiak2019), which uses automatic differentiation to efficiently generate sample trajectories for moderately high numbers of model parameters. We use the velocity and thickness predictions from the neural network as the observations such that MCMC only samples the rigidity posterior distribution, which allows for a more direct comparison between the variational and sampled posterior.
We find that both MCMC and variational inference recover a posterior mean profile for θ that is close to the true values for areas >20 km upstream from the ice front (Fig. 2). Close to the ice front, both methods predict uncertainties that are substantially larger due to thinner ice, which reduces the sensitivity of the longitudinal and lateral membrane stresses (left-hand terms in Eqn (17)) to rigidity variations. These uncertainties near the ice front can likely be reduced by augmenting the SSA residual loss with a dynamic boundary condition at the ice front that balances longitudinal stresses normal to the ice front with the difference in hydrostatic pressure between the ice and ocean water (Larour and others, Reference Larour, Rignot, Joughin and Aubry2005). Pair plots of marginal distributions of θ at different locations along the ice shelf show that the variational approach is able to recover strong covariances between θ samples for locations that are relatively close to each other while ensuring samples are uncorrelated for larger pair-wise distances. In general, the strength of the posterior covariance will be modulated by the physical model as well as the prior correlation lengthscale. Closer to the ice front, the marginal distributions derived from MCMC indicate a slight deviation from Gaussian behavior, which is again likely due to the lower ice thicknesses limiting ice stress sensitivity to rigidity variations. Since the variational distribution is constrained to be a multivariate normal, it is unable to recover the skewed, non-Gaussian behavior in the marginals in these areas. For applications where it is desirable to place more probability mass in the longer-tails of the posterior distribution, one could simply increase relevant variances ($\sigma _\theta$, σ r) to inflate the variational posterior variances or use a more flexible variational approximation not restricted to multivariate normals (Rezende and Mohamed, Reference Rezende and Mohamed2015).
3.2. 2D ice shelf
We now simulate a 2D ice shelf using the icepack ice flow modeling software (Shapero and others, Reference Shapero, Badgeley, Hoffman and Joughin2021). Similar to the synthetic ice shelf presented in Shapero and others (Reference Shapero, Badgeley, Hoffman and Joughin2021), we prescribe a semi-circular shelf geometry with four inlet glaciers of varying widths (Fig. 3). Additionally, we prescribe a bed topography that results in a few pinning points where the flotation height is positive, i.e. the ice is actually grounded at these locations. Under the shallow-stream approximation, we prescribe a basal drag friction coefficient proportional to the flotation height such that friction is only non-zero for grounded ice. Such pinning points in the form of ice rumples are common in ice shelves in Antarctica. However, assuming a fully floating ice shelf during inversion for rheological parameters will introduce errors into the inferred parameter field due to model mismatch. Therefore, by purposefully injecting modeling errors into the estimation procedure, we can assess how the two different cost functions and the estimated parameter uncertainties respond to such errors.
We use icepack to first simulate the evolution of shelf velocity and thickness for roughly 500 years with a constant ice rigidity, B 0, corresponding to an ice temperature of $-5\, ^\circ$C, a net surface mass balance of 0, and a stress exponent of n = 4. Here, we choose n = 4 in order to evaluate the sensitivity of the rigidity inference to 2D ice stress variations that are more likely to be found in natural environments of fast-flowing ice (Bons and others, Reference Bons2018; Millstein and others, Reference Millstein, Minchew and Pegler2022). After the first simulation stage, we apply a continuum damage mechanics model that modulates the rigidity field with an evolving damage factor, D, such that B D = (1 − D)B 0 (Borstad and others, Reference Borstad, Rignot, Mouginot and Schodlok2013; Albrecht and Levermann, Reference Albrecht and Levermann2014). This approach provides a physically realistic means to obtain a spatially varying prefactor field with rheology-modifying processes such as shear weakening. We run the damage-enhanced model for an additional 200 years to achieve approximate steady-state. At the end of the simulation, we can observe substantial spatial variation in damage, where ice is nearly undamaged at the grounding line (due to a zero-damage boundary condition) and highly damaged near the ice front, at shear margins and downstream of the pinning points. The dynamic effective viscosity field shows concentrated low viscosities near the pinning points and higher viscosities between the inlet ice streams where deformation rates are lower. Overall, the viscosities exhibit a mix of short- and long-wavelength features, which are mirrored in the effective strain rate field.
3.2.1. Variational inference setup
For recovery of the rigidity field, as we discussed during selection of the prior variance, we explore both the conventional control method-based inversion and the variational inference approach based on the momentum balance objective, as well as a combination of the two where we use the inversion to set B 0 for the prior. For all approaches, we use the simulated ice surface elevation to compute ice thickness by assuming hydrostatic equilibrium (buoyancy). Over floating ice, the thickness values derived from buoyancy are identical to the simulated thickness, but over the pinning points, the actual thickness values are lower, which results in an overestimation of the driving stress variations using the buoyancy conversion (Fig. S2). Furthermore, assuming flotation for the entire ice shelf will neglect the basal drag provided by the pinning points. The combined data and modeling errors will impact recovery of the prefactor field, which we explore shortly.
The control method inversion is again performed with icepack, using a Gauss–Newton solver to minimize a joint objective function that combines a velocity prediction error function and a regularization function based on the first-derivative of the log rigidity field, θ. The inversion includes Dirichlet boundary conditions for velocity values at the grounding line and Neumann (dynamic) boundary conditions at the ice-ocean interface. For the variational inference problem, we select 20 000 uniformly random locations on the ice shelf to extract velocity and thickness values to use as training data for the network f ψ (feedforward network of four layers of 100 nodes each), which is only tasked with reconstructing the surface observations. We select an additional, independent set of 20 000 random locations, or collocation points, for training the VGP (with 750 inducing index points), which is tasked with predicting the parameters of the variational distribution q(θ). Both the number of observations and collocation points will influence the effective spatial resolution of θ and can be interpreted as a mesh-free analog of the mesh element size. For all priors, we prescribe a lengthscale of 15 km, and for the prior with a uniform B 0, we use a value of B 0 = 260 yr1/4 kPa. After training, we evaluate training performance by reconstructing the surface observations over the entire model domain (using f ψ), as well as the predicted SSA residuals (using f ψ and mean rigidity as predicted by the VGP). For the variational inference predictions, the observation misfits and drag residual are minimal over most of the modeling domain but are higher over the two largest pinning points (Fig. S4). The higher errors are a function of oversmoothing of the observations and model mismatch, which amounts to assuming ice is floating over the grounded pinning points. As a consistency check, we use the posterior samples of θ to generate stochastic predictions of velocity using the standard forward model and find that velocity errors are generally <5% of the flow speed, with higher error values localized to the pinning points (Fig. S5). We note that the velocity errors are commensurate with those from the conventional inversion.
3.2.2. Evaluation of rigidity reconstruction errors
A more detailed comparison of the recovery error for B (recovered using $B = B_0\, {\rm e}^\theta$) between the control method inversion and variational inference reveals that the two methods are complementary. The control method inversion has the lowest overall error bias, but the areas where the errors are largest are systematically upstream of the pinning points (Fig. 4). Since we assume all ice is floating for the forward model, the missing resistive stress provided by drag at grounded ice is compensated by artificially making the ice stiffer upstream of the pinning points, which acts to slow the ice down in a manner that allows the predicted velocities to match the observed velocities. In contrast, the B recovered by variational inference (which uses the SSA momentum balance as a forward model) shows larger errors directly over the pinning points, as well as in areas where ice is stagnant (low strain rate). Over the pinning points, the true rheology sharply varies from about 250 to 200 prior yr1/4 kPa (Fig. 3d). However, the prior lengthscale of 15 km encourages spatially smoother fields of ice rigidity, which limits the dynamic range of ice stresses that can be modeled in order to satisfy the SSA momentum balance. Since the driving stress variations over the pinning points are overestimated due to the buoyancy assumption (Fig. S2), the preferred solution is to smooth out all stress variations over the grounded ice in order to minimize the residual SSA terms. Upstream of the pinning points and closer to the grounding line, the recovery errors are actually lower using variational inference as compared to the control method inversion. The spatial patterns in the recovery errors are similar to the patterns of residual SSA components (Fig. S4). Finally, by using the control method inversion as the reference B 0 for variational inference, we can minimize much of the recovery errors closer to the ice front and in areas where strain rates are lower but flow speeds are still high, i.e. areas where the inversion has greater sensitivity and where the dynamic boundary condition at the ice-ocean interface provides additional constraints on the rigidity (Fig. 4c).
3.2.3. Evaluation of rigidity uncertainties
The predicted uncertainties for θ are consistent with the reconstruction errors: uncertainties are higher closer to the ice front where ice thicknesses are lower (as observed in the 1D case), as well as in more stagnant ice where strain rates are lower (Figs 4d, 5). Uncertainties near the ice front are reduced when using B 0 from the control method inversion (Fig. S3), which is again likely due to the incorporation of the dynamic boundary condition at the ice-ocean interface which also reduced reconstruction errors. In areas where ice is thinner but strain rates are higher (e.g. higher shear strain rate in the areas between the fast-flowing ice), the balance between extensional stresses and lateral drag also provides sufficient signal for reducing uncertainties. In a few isolated patches, even when effective strain rates are low and ice is relatively thin, slightly positive lateral forces that act as a ‘pull’ on the ice can also reduce uncertainties (Fig. 5). Directly over the pinning points, uncertainties are low due to the high strain rates there. This mismatch between low uncertainties and high SSA residuals over the pinning points may be compensated by inflating the variance $\sigma _{\bf r}^2$ over areas where measured flotation height is near zero (see Discussion section). Downstream of the pinning points, we observe higher uncertainties due to downstream thinning of the ice. Overall, the uncertainty maps for θ indicate that areas with thicker ice and higher strain rates are better constrained, and targeted inverse modeling (e.g. estimating B 0 from a control method inversion) can be an effective tool for further reducing uncertainties (Fig. S3).
4. West Antarctica ice shelves
We now apply our methods to select large ice shelves in West Antarctica, specifically the Larsen C Ice Shelf (LCIS), Filcher-Ronne Ice Shelf (FRIS), Ross Ice Shelf (RIS) and the combined Brunt Ice Shelf with Stancombe-Wills Ice Tongue and Riiser-Larsen Ice Shelf (B-SW-RL) (Fig. 6). These ice shelves are fairly representative of shelf environments on the Antarctic coast and serve as a robust testing suite for several reasons. Firstly, they encompass a large area (48, 380, 440 and 68 × 103 km3 for LCIS, FRIS, RIS and B-SW-RL, respectively), corresponding to a large number of effective modeling parameters in order to test the inference capacity of the VGP. Secondly, the ice shelves are subject to different flow and buttressing environments. Large ice rises in Larsen C have favored the formation of large rifts, the evolution of which are complicated by the presence of mechanically weak suture zones that likely contain large proportions of mechanically weak marine ice (Jansen and others, Reference Jansen, Luckman, Kulessa, Holland and King2013; Kulessa and others, Reference Kulessa, Jansen, Luckman, King and Sammonds2014; Borstad and others, Reference Borstad, McGrath and Pope2017). Within Ross Ice Shelf (the largest ice shelf in Antarctica), a mix of ice rises, ice rumples and large islands serve to create a heterogeneous flow environment involving localized grounding, rift formation and shear margin weakening. Many of these pinning points lie in the western portion of the shelf off the Siple Coast, which drains much of the West Antarctic Ice Sheet through fast-flowing ice streams. Filchner-Ronne is also fed by several fast-flowing ice streams with large ice thicknesses, leading to larger driving stresses over the ice shelf with the highest overall flow speeds of the ice shelves examined here. The Brunt-Stancomb-Wills-Riiser-Larsen shelf complex (B-SW-RL) is subject to lower buttressing than Larsen C or Ronne-Filchner due to lack of embayments. However, within the Riiser-Larsen shelf are a few prominent pinning points that do provide limited buttressing but also serve as potential areas of model mismatch, similar to the synthetic ice shelf we previously investigated. Additionally, much of the ice in the Stancomb-Wills ice tongue is more loosely packed, leading to large surface gradients at the edges of individual ice units that are not well-matched to velocity variations.
4.1. Remote-sensing data and pre-processing strategy
For RIS and B-SW-RL, we use the MEaSUREs velocity mosaic (Rignot and others, Reference Rignot, Mouginot and Scheuchl2011; Mouginot and others, Reference Mouginot, Scheuchl and Rignot2012), which combines speckle tracking of SAR images from various satellite platforms with feature tracking of Landsat 8 images and has a nominal temporal coverage between 2009 and 2016. For LCIS and FRIS, we use a 2020 annual velocity mosaic provided by ITS_LIVE, which is derived from feature tracking of Landsat 7 and 8 images over Antarctica (Gardner and others, Reference Gardner, Fahnestock and Scambos2019). From a visual inspection, we found that the ITS_LIVE mosaic exhibited fewer velocity artifacts for LCIS and FRIS, whereas the MEaSUREs mosaic exhibited fewer artifacts over B-SW-RL and provided full coverage over RIS. Ice thickness and elevation data are derived from BedMachine V2 (Morlighem and others, Reference Morlighem2020), which combines radar-estimated thickness profiles with mass conservation constraints and firn corrections in order to obtain continuous thickness maps. While the nominal year for the thickness data is 2015, the correspondence between the velocity and thickness data are sufficient for the spatial resolution of our analysis (assuming an upper bound of ≈5 km of motion for feature advection). For all velocity and thickness rasters, we first perform a void-filling operation that uses a spring-based PDE constraint to fill in missing data (D'Errico, Reference D'Errico2012). The rasters are then filtered to ≈10−15 times the average ice thickness using a Savitzky–Golay filter in order to remove high-frequency components not resolvable by the SSA momentum balance.
4.2. Variational inference setup
As with the simulated ice shelf, we first invert for B using icepack in order to optimize a cost function combining a velocity misfit term (weighted by the formal uncertainties for the velocity estimates) and a regularization term based on first-order spatial gradients to encourage smoother solutions. All ice shelves are discretized into 2D triangular finite elements with an average size of 5 km. A penalty parameter controlling the relative contribution of the regularization term is selected with a standard L-curve analysis, independently for each ice shelf. For each ice shelf, we use feedforward neural networks with four layers of 100 nodes each and VGPs with 600–900 inducing index points. The training data for the feedforward neural networks and the collocation points for the VGP are independently constructed with a point density of 2 points per 10 square kilometers, which is commensurate with the size of the mesh elements used for the icepack inversion. This point density for the VGP is sufficient to model a moderate-resolution (~20 km) θ field, which when combined with the higher-resolution B 0 from the control method inversion results in rigidity samples that resolve important variations near shear margins and rifts. The neural network training observations are randomly sampled directly from the raster grids, whereas the collocation points are randomly generated using a latin hypercube sampling scheme. Finally, we use the estimated B field as the reference field B 0 for parameterization of θ, setting the prior variance for θ to 0.22.
4.3. Analysis of rigidity mean and uncertainty fields
To a first-order approximation, ice is inferred to be stiffer for FRIS and RIS than for LCIS and BWSRL, and average rigidity values for LCIS are the lowest of the four (Fig. 6). These first-order trends are well-matched by modeled ice shelf surface temperatures where temperatures for FRIS are generally around −25 to $-30\, ^\circ$C, whereas for LCIS they range from −15 to $-10\, ^\circ$C (Fig. S8). However, all ice shelves exhibit significant spatial variability in inferred ice rigidity beyond surface temperature variations in order to minimize SSA residuals (Figs S9, S10). For FRIS, the estimated mean B field is broadly consistent with results from prior studies (e.g. MacAyeal and others, Reference MacAyeal, Rignot and Hulbe1998; Larour and others, Reference Larour, Rignot, Joughin and Aubry2005). Ice is inferred to be substantially softer in the shear margins where strong lateral shearing leads to viscous dissipation and elevated ice temperatures. These shear margins are prominent in the Ronne Ice Shelf where fast-flowing floating ice is in contact with rock (along the Orville coast and Berkner Island) or stagnant ice, as is the case downstream of the Korff Ice Rise. As discussed in Larour and others (Reference Larour, Rignot, Joughin and Aubry2005), larger basal melt rates on the northern tip of the Henry Ice Rise are coincident with softer ice. Within the Filchner Ice Shelf, lower overall values of B indicate softer ice, again in the shear margins where ice streams flow onto the shelf and are in contact with stagnant ice. A large lateral surface crevasse close to the ice front is also associated with higher strain rates and softer ice. We can also observe localized regions of substantially stiffer ice, such as downstream of the Foundation Ice Stream and upstream of the Korff and Henry Ice Rises. These regions are associated with larger driving stresses (Fig. S7) such that ice is inferred to be stiffer in order to provide enough resistive stresses to balance those driving stresses. Ice is also inferred to be stiffer closer to the grounding line where colder ice is advected by the ice streams. For all ice shelves, rigidity uncertainties are mostly lower where ice is thicker and strain rates are larger, similar to what was observed for the simulated 1D and 2D ice shelves (Fig. 7).
Similar to FRIS, the ice in the central portions of RIS are inferred to be more rigid, likely due to relatively cold surface temperatures of −20 $^\circ$C. However, we can also observe zones of softer ice near shear margins and localized areas of grounding. At the inlet of the Byrd Glacier to the west, prominent shear margins separating the fast-flowing inlet ice from more stagnant shelf ice are coherent for more than 300 km downstream of the grounding line (Fig. S7), which results in substantial shear weakening. In the central trunk of the Byrd Glacier inlet, the reduction in flow speed as the ice flows onto RIS leads to enhanced compressional stress and thickening of the ice, leading to inferred higher B values. On the east side of RIS, the Shirase Coast Ice Rumples (SCIR) at the outlet of the MacAyeal and Bindschadler Ice Streams significantly modify the flow field and ice thickness due to grounding of the ice, consistent with the simulated pinning points for the synthetic ice shelf. Thinning of ice downstream of SCIR and diversion of the shear margins toward Roosevelt Island (RI) are both dynamical effects that modify the buttressing capability of ice in this region (Still and others, Reference Still, Campbell and Hulbe2019; Still and Hulbe, Reference Still and Hulbe2021). In our inferred B field, the ice covering the rumples is inferred to be softer while the downstream ice connected to RI is inferred to be stiffer. Alternatively, the ice upstream of Steershead Ice Rise (SIR) is near-stagnant, leading to very high inferred values for B. Downstream of SIR is a streakline of thin ice coincident with the shear margin of the inlet of MacAyeal and Bindschadler Ice Streams, leading to a narrow zone of soft ice that persists nearly all the way to the ice front.
At LCIS, the softest ice is inferred within highly localized areas corresponding to surface crevassing, including the large rift originating from the Gipps Ice Rise (Khazendar and others, Reference Khazendar, Rignot and Larour2011; Larour and others, Reference Larour, Rignot, Poinelli and Scheuchl2021). It is likely that some fraction of the inferred softness is due to not explicitly including rifts (geometrically and dynamically) within the ice flow model, which can reproduce a significant proportion of the observed strain rates with active opening/closing of rifts (Larour and others, Reference Larour, Rignot, Poinelli and Scheuchl2021). As is the case with FRIS, stiffer ice is inferred near the grounding line where colder and thicker ice is advected downstream by the inlet ice streams. Within the ice shelf, areas in between faster flowing ice correspond to thinner ice and higher strain rates, resulting in softer ice. Unlike FRIS, the proximity of the fast flowing inlet ice streams with one another limits the areal extent of stagnant ice over Larsen C. High effective strain rates between ice streams are aligned with the initiation of suture zones where mechanically weak marine ice (sourced from warmer ocean water) has been observed to accumulate at the base of LCIS (Kulessa and others, Reference Kulessa, Jansen, Luckman, King and Sammonds2014). The initial portion of the suture zones within approximately 20–30 km downstream of promontories and peninsulas are associated with inferred softer ice. Upstream of the Bawden Ice Rise (BIR), strain rates are substantially lower and correspond to larger inferred B values. Here, the correspondence between large fractures and a simulated confluence of meltwater plumes is hypothesized to stimulate abundant accretion of marine ice, which can actually lead to ice stiffening (Khazendar and others, Reference Khazendar, Rignot and Larour2011).
Finally, for B-SW-RL, ice is inferred to be substantially softer in the mélange area that separates the Brunt Ice Shelf from the Stancomb-Wills Ice Tongue, as well as in the mélange that separates the latter from the Riiser-Larsen Ice Shelf. These areas, which contain a heterogeneous mixture of marine ice, sea ice and ice shelf debris, have previously been inferred to exhibit lower rigidity values (within a continuum mechanics model) and act to bind large ice fragments to the coast (Khazendar and others, Reference Khazendar, Rignot and Larour2009). Since the mélange is less coherent than meteoric ice advected from the ice streams, it deforms readily and corresponds to high strain rates. Additionally, prominent surface crevasses throughout B-SW are also associated with softer ice, including several transverse rifts close to the grounding line of Brunt Ice Shelf and a frontal rift separating the northeastern corner of Brunt Ice Shelf from the Stancomb-Wills Ice Tongue. Since the nominal temporal coverage of the MEaSUREs velocity data is 2009–2016, the Halloween Crack has not yet initiated (De Rydt and others, Reference De Rydt, Gudmundsson, Nagler and Wuite2019). At the southern edge of Brunt Ice Shelf at the base of Chasm 1, ice is actually inferred to have high mean B, but since uncertainties are large here (Fig. 7), we consider this to be a smoothing artifact stemming from larger thickness errors near the large rifts. Upstream of the prominent pinning point on the Riiser-Larsen Ice Shelf (PP in Fig. 6), ice is inferred to be stiffer, similar to what we observed with the pinning points for the simulated ice shelf as a compensation for unmodeled basal drag. The thinner ice downstream of the pinning point is correspondingly inferred to be softer. We do note that the orientation of the flow field relative to the pinning point is more oblique than that of our simulated shelf, which likely is the source of the more complex strain rate pattern adjacent to the pinning point (Figs S7, S11). Finally, upstream of Lyddan Island in the mélange at the eastern edge of Stancomb-Wills, ice is inferred to have high rigidity, but as this area corresponds to both low strain rates and low driving stress, the uncertainty in rigidity is very large.
4.4. Posterior predictive distributions and ice shelf buttressing
After obtaining the variational distribution that best approximates the posterior distribution for the ice rigidity, we can compute a posterior predictive distribution for any quantity or forward model that depends on the rigidity. The most straightforward way to accomplish this is to generate random samples from the variational distribution and pass each sample through the forward model of interest, i.e. Monte Carlo approximation. For example, one could perform a dynamic perturbation analysis on specific ice shelves by applying some form of stress perturbation at the ice front (calving event, gain/loss of buttressing sea ice, etc.) and running prognostic simulations for different realizations of the rigidity, sampled from the posterior distribution. This type of analysis has been performed in many studies to assess sensitivity of ice shelves to changing climate conditions (e.g. Schlegel and others, Reference Schlegel2018; Nias and others, Reference Nias, Cornford, Edwards, Gourmelen and Payne2019), but usually the rigidity field is varied by choosing some uniform upper and lower bound guided by expected temperature variations or other a priori knowledge on creep mechanisms. By instead using the posterior distribution to draw samples of the rigidity, we automatically incorporate information derived from surface observations while also allowing known physical laws (e.g. SSA equations) to induce realistic covariances between values of the rigidity over finite length scales. In other words, the combined information from data and flow equations results in more realistic samples of physical parameters consistent with all available knowledge.
Since one of the most important physical implications of ice shelf rheology is the amount of buttressing applied to inland grounded ice, we use the variational distribution for B to compute the distribution of maximum buttressing factors following Fürst and others (Reference Fürst2016). The normal buttressing number is defined as (Gudmundsson, Reference Gudmundsson2013):
where τ is the deviatoric stress tensor, the quantity $N_0 = {1\over 2} \rho _i \left(1 - {\rho _i}/{\rho _w}\right)gh$ is the vertically integrated pressure exerted by the ocean (with density ρ w) on the ice shelf, and $\hat {{\bf n}}$ is a normal vector selected to be aligned with the second principal stress, following Fürst and others (Reference Fürst2016). By performing systematic calving simulations where ice is removed from an ice shelf up to different buttressing factor isolines, the increase in ice flux across the ice front or grounding line can be predicted for various buttressing factors (Fürst and others, Reference Fürst2016). The buttressing factor above which ice flux is projected to rapidly increase then serves as a buttressing threshold for a given ice shelf. The isoline corresponding to the threshold can then delineate regions of ‘passive’ shelf ice (PSI), defined as ice that can be removed without significantly altering the flow dynamics of the adjacent ice. As the normal force in the buttressing factor is computed from the ice stress tensor, which itself depends on the rigidity B to estimate the stress components, the buttressing factor will be subject to random variations consistent with the posterior samples of B. We can therefore estimate the expected variation in PSI consistent with the surface observations. To estimate a more realistic estimate of PSI area specific to calving, we only include buttressing factor isolines that form polygons that intersect the ice front, meaning we exclude areas of isolated PSI closer to the grounding line.
The buttressing thresholds originally presented by Fürst and others (Reference Fürst2016) corresponded to flux increases across the ice front, leading to threshold values of 0.3–0.4 for the ice shelves investigated here. Alternatively, thresholds defined for increased ice flux across the grounding line are found to be a better predictor for ice shelf stability in response to instantaneous calving events (Reese and others, Reference Reese, Gudmundsson, Levermann and Winkelmann2018; Mitcham and others, Reference Mitcham, Gudmundsson and Bamber2022). These buttressing values tend to range from 0.8 to 0.9. For the purposes of comparison with the result of Fürst and others (Reference Fürst2016), we use a lower threshold of 0.4 roughly corresponding to a step increase in flux across the ice front. Due to slight biases between our inferred mean B fields and the fields estimated by Fürst and others (Reference Fürst2016), our threshold value of 0.4 is slightly higher than that used by Fürst and others (Reference Fürst2016) in order to roughly match the PSI regions in that study.
We observe variations in PSI that lie roughly within the bounds computed from ±10% variation of the mean B, following Fürst and others (Reference Fürst2016) (Fig. 8). However, we can observe additional spatial and statistical patterns beyond the simple ±10% variations. For the ice shelves that are laterally confined by embayments, there are a significant number of samples of the PSI boundary that exceed the upper and lower bounds. Over Larsen C, the PSI boundary samples are slightly skewed toward lower PSI areas. However, several posterior samples of B actually connect passive ice centered on the rift originating from GIR to passive ice at the ice front, which increases total PSI area and slightly reduces the vulnerability of Larsen C to ice loss. Over FRIS and Ross, the PSI distribution is more symmetrical, although the former has a long tail of lower PSI areas, which correspond to a slight increase in vulnerability of those shelves to ice loss. Finally, over B-SW-R, the distribution of PSI is near-symmetric and lies well within the ±10% bounds. However, the difference in spatial extent between the ±10% bounds is larger than for the other ice shelves, particularly for the Stancomb-Wills ice tongue, which indicates a greater sensitivity to variations and uncertainties in inferred ice rigidity. This sensitivity is likely reflective of the lack of lateral confinement and drag and highlights the importance of embayment geometry on ice shelf buttressing force. Overall, these results demonstrate that calibration of ice shelf rigidity and associated uncertainties using surface data can both inflate/deflate predictive uncertainties and needs to be performed on a shelf-by-shelf basis.
5. Discussion
We demonstrated our proposed physics-informed variational inference framework by estimating the posterior distribution of ice rigidity for synthetic and large-scale ice shelves in Antarctica. The variational inference scheme produces posterior distributions of rigidity that agree well with those estimated by MCMC methods while providing a scalable approach for exploring uncertainties in parameter fields and forward predictions. We now briefly discuss potential avenues for further exploration of ice rheological parameters using distributions of B, as well as future algorithmic and computational improvements.
5.1. Uncertainties in ice rigidity propagated to flow law parameters
In this work, we focused on estimating the variational distribution for ice rigidity, B, and demonstrate how the the inferred uncertainties can be used to form predictive distributions on a derived buttressing factor. However, B was defined using the form of Glen's Flow Law in Eqn (1), which aggregates multiple physical factors into a single prefactor. The prefactor can be disaggregated using an Arrhenius-type relation with the following form (using the convention that B = A −1/n) (Cuffey and Paterson, Reference Cuffey and Paterson2010):
where A 0 is a reference prefactor value, R is the ideal gas constant, T is temperature, $T_0 \approx -10\, ^\circ$C is a transition temperature corresponding to a switch in the activation energy for creep, Q c, and E is an enhancement factor that depends on the ice crystallographic fabric, grain size, damage, and water and impurity content. Therefore, it is possible to decompose the inferred distribution of B into probability distributions for the unknown parameters in the above relation (all parameters except R, the ideal gas constant, a universal constant whose value is well-constrained) (Ranganathan and Minchew, Reference Ranganathan and Minchew2022). However, such a decomposition is highly ill-posed and only possible if relatively strong prior constraints are available for the parameters. For example, ice temperatures can be measured at select locations and modeled independently with an appropriate thermomechanical model. The spatial variations in E are likely to be highly correlated with the deformation mode (e.g. simple shear vs extension), which can be well-approximated from surface strain-rates. On the other hand, the activation energy Q c, which is temperature dependent through T 0, is likely to be relatively uniform within the two separate temperature regimes partitioned by T 0. The differences in expected spatial variation can thus be used as prior constraints when forming the joint posterior distribution of the parameters in Eqn (20).
5.2. Influence of modeling errors
Models of complex physical systems are generally incomplete and do not fully represent all physical processes found in natural settings. Modeling errors will therefore affect inference of parameter values and associated posterior distributions (Kennedy and O'Hagan, Reference Kennedy and O'Hagan2001). In the case of ice shelves, we have represented ice flow in a continuum mechanics framework with a momentum balance based on the SSA, which assumes that the vertical profile of ice rigidity for an ice column can be represented by its depth-averaged value and that all ice is floating within the ice shelf. The former assumption likely results in inconsequential prediction errors since ocean water provides minimal drag to the base of ice shelves. The assumption of floating ice is violated in areas where ice is locally grounded, which in the 2D synthetic shelf we observed can cause a localized bias in inferred rigidity values around and upstream of the grounded area. These biases arise from the uniform uncertainties, σ r, we prescribed in the likelihood model in Eqn (12). In reality, these uncertainties should be scaled according to expected variations in SSA residuals, which when corresponding to un-modeled basal drag can be informed by estimates of flotation height. A simple scaling of the uncertainties follows from consideration of the sensitivity of the forward model to the SSA residuals r x and r y, which are nominally zero over ice shelves. Since the forward model used here directly uses the SSA momentum balance, the sensitivity matrix for each SSA residual component is identity, and the total prediction uncertainty is proportional to the uncertainties in the nominal values for the residuals (Duputel and others, Reference Duputel, Agram, Simons, Minson and Beck2014). This approach is appropriate when the primary objective is physical interpretation of the distribution of rigidity values (as discussed in the previous section). However, if the primary goal is to use the posterior distribution of rigidity to construct ensembles of ice flow model runs (e.g. to estimate range of probable contributions to sea level rise), then a bias in the distribution for rigidity is acceptable since an increase in ice rigidity will compensate for the missing basal drag for grounded ice.
Another source of modeling uncertainty comes from our use of a conventional inverse method to pre-compute a B 0 field to be used as a prior mean. This strategy nominally reduces uncertainties in ice rigidity near the ice front (Fig. S3). However, the conventional inversion requires specification of a dynamic boundary condition at the ice front based on the hydrostatic pressure provided by the ocean water. In areas where considerable sea ice has formed at the ice front, uncompensated buttressing stress provided by the sea ice will lead to biased estimates of B 0, which can be considered as an additional source of modeling uncertainty. One strategy to account for such uncertainties is to treat B 0 as a hyperparameter in order to formulate a hyperprior for the rigidity. However, this approach would require a reformulation of the variational lower bound. A simpler approach would be to pre-construct a spatial map of uncertainties in B 0 due to uncertainties in the dynamic boundary condition by repeating the control method inversion for different values of buttressing stress at the ice front. The map of B 0 uncertainties can then be used to inflate the prior variance $\sigma _\theta ^2$ in areas where B 0 is more uncertain. A more direct quantitative approach would be to incorporate the dynamic boundary conditions in an additional loss function for the VGP-generated rigidity. The hydrostatic pressure provided by the ocean water would be treated as a random variable with some uncertainty, which would allow for the construction of a likelihood function that measured the misfit between the probabilistic longitudinal stress and the probabilistic hydrostatic pressure, both of which can be readily computed by the neural network and VGP using automatic differentiation.
5.3. Integration with numerical ice flow models
The SSA momentum balance is the basis of the forward model for our method, which differs from the forward model of predicted ice velocities used in traditional control-method-based inversions and previous studies investigating Bayesian methods for parameter estimation for ice dynamics (see section on related work). If the surface data are noise-free and the boundary conditions at the grounding line and ice front are known perfectly, the two different forward models would result in identical point estimates of ice rigidity. However, even in this ideal scenario, posterior inference with the two different forward models would lead to uncertainties with different spatial variations due to different model sensitivities. The velocity-based forward model is most sensitive to rigidity variations where velocities are higher, usually closer to the ice front. On the other hand, the momentum-based forward model is most sensitive to rigidity variations closer to the grounding line where driving stresses are higher, as well as in high strain-rate regions where driving stresses are lower (Fig. 5).
The different sensitivities will also lead to different systematic errors in the inferred rigidity. As shown with the synthetic 2D ice shelf, inference with the velocity-based forward model can lead to erroneously high rigidity values upstream of pinning points in order to compensate for omission of the basal drag of the pinning points in the ice shelf momentum balance. Alternatively, for the momentum-based forward model, when observed strain rates are relatively large in magnitude and have spatial length scales smaller than the prior length scale, the inferred rigidity will tend to be lower in order to minimize stresses in the momentum balance. This situation is more likely when the noise level in the velocity data is relatively high compared to the signal or if the velocity data contain high-frequency signals below the resolution of the momentum balance. In these cases, one possible strategy is to use the predicted velocities from the control method inversion as the ‘observed’ velocities for the variational inference framework. The predicted velocities from the former would have essentially been pre-filtered according to the momentum balance, leading to manageably low initial values of SSA residuals r for the momentum-based variational inference.
One possible avenue for future work is to integrate the variational inference scheme of Brinkerhoff (Reference Brinkerhoff2022), which uses a velocity-based forward model, with the methods presented here in order to provide complementary model sensitivities. Furthermore, recent advances in deep learning-based surrogate modeling could significantly improve the computational efficiency of velocity-based forward models by replacing expensive forward solves with much cheaper neural network predictions (Jouvet and others, Reference Jouvet2021).
5.4. Uncertainty quantification in physics-informed machine learning and computational considerations
In this work, we focused on estimating a variational distribution for the ice rigidity, conditional on observations of ice surface velocity and thickness and the SSA governing equations for ice flow. The methods presented here are also applicable to similar classes of physics-informed machine learning problems where parameter fields govern the evolution of observable spatiotemporal fields (e.g. Raissi and others, Reference Raissi, Yazdani and Karniadakis2020). For example, the variational inference framework could be used to infer a spatially varying thermal diffusivity for a model governed by the heat equation. Time-dependent observations of temperature profiles would be reconstructed by a neural network, T(x, y) = f ψ(x, y, t), and a VGP would be trained to generate samples of the thermal diffusivity at arbitrary spatial coordinates. Furthermore, if temporal gradients are computable, then the variational inference can be extended to time-dependent PDEs. Overall, as long as the governing equations of the physical system can be evaluated at arbitrary coordinates, the methods presented here are transferable. In some cases, the use of the feedforward neural network f ψ is not necessary when high-quality observations (and their necessary gradients) are available or can be pre-processed. Then, one would only have to estimate the parameters of the VGP for estimating the posterior distribution of the parameter field of interest.
From a computational efficiency standpoint, the VGP used for the variational distribution is a marked improvement from standard GPs, but the need to learn a full-rank Gaussian distribution at the inducing points still prevents the VGP from being applicable to very large spatial domains (or, equivalently, model domains where high-spatial resolution of parameter fields is desired). As the number of inducing points exceeds ≈1000, computational and memory requirements become excessive and training efficiency drops dramatically. While training efficiency of VGPs could be improved through the use of second-order optimizers (e.g. Newton- or quasi-Newton-based optimizers), joint training of VGP and neural network parameters would become intractable since neural networks tend to have a significantly larger number of parameters. Therefore, future work must involve the development of alternative models for variational distributions that are suitable for large effective model dimensions (e.g. Brinkerhoff, Reference Brinkerhoff2022).
6. Conclusions
In this work, we present a framework for inferring the posterior distribution of ice rheology for large ice shelves in West Antarctica. Motivated by recent advances in physics-informed machine learning and variational inference, the framework utilizes neural networks to reconstruct spatially dense observations of ice surface velocity and thickness, which allows for mesh-free evaluation of surface variable values and associated spatial gradients. At the same time, we task a variational Gaussian Process to predict the mean and covariance of ice rheological parameters for arbitrary spatial coordinates. By using the momentum balance for ice-flow appropriate for ice shelves, we formulate a mapping from parameters (rheology) to observables (residual momentum) that is inherently parallelizable and allows for joint training of the neural networks and variational Gaussian Process using stochastic gradient descent. The training objective utilizes a variational approximation to Bayesian inference, which provides an explicit way to encode prior rheology information in the form of spatial lengthscales (to modulate smoothing of the inferred rheology field) and range of variation relative to a reference field. For the latter, we show that using a conventional inversion method to estimate a prior mean field can reduce reconstruction errors, which demonstrates a potentially favorable approach to exploring uncertainties in large-scale ice flow models without injecting them into computationally expensive MCMC samplers.
Using these methods, we demonstrate posterior inference of ice rheology for synthetic 1D and 2D ice shelves. We find that rheological uncertainties are lowest where driving stresses and strain rates are higher, corresponding to larger components of the momentum balance and higher levels of ice deformation, which implies more information about ice rheology. Using the synthetic 2D ice shelves, we also demonstrate how the momentum balance-based forward model can help reduce biases in inferred ice rigidity near areas of localized grounding where the shallow-shelf approximation of the momentum balance is violated. Inference of the distribution of ice rigidity values for select West Antarctic ice shelves reveal a wide range of spatial patterns consistent with highly heterogeneous flow environments. Generally, we find softer inferred ice in shear margins, near pinning points, and around visible surface crevasses. Conversely, ice is inferred to be stiffer where bulk ice temperatures are lower and where compressional stresses result in thickening of ice, such as upstream of certain ice rises or where fast-flowing ice streams flow onto slower shelf ice. Finally, using the posterior covariances of rigidity, we generate stochastic predictions of buttressing factors for the West Antarctic ice shelves and show how different flow environments can result in different ranges of passive shelf ice areas, as well as different levels of non-Gaussian behavior. These results demonstrate the utility of site-specific posterior inference for predictive modeling as opposed to assuming uniform lower and upper bounds for an entire ice sheet.
Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/jog.2023.8.
Data
ITS_LIVE velocity data can be downloaded at https://nsidc.org/apps/itslive/. MEaSUREs velocity data can be downloaded at https://nsidc.org/data/NSIDC-0484/versions/2. BedMachine V2 thickness data can be downloaded at https://nsidc.org/data/NSIDC-0756. Example scripts for variational inference for 1D and 2D ice shelves can be found at https://github.com/bryanvriel/physics-informed-vi. Neural network and variational Gaussian Process models were written using TensorFlow (https://www.tensorflow.org) and TensorFlow Probability (https://www.tensorflow.org/probability).
Acknowledgements
The authors thank two anonymous reviewers and the Scientific Editor Doug Brinkerhoff for their constructive feedback for improving the quality of this work. This research was partially supported by NSFGEO-NERC grant 1853918, the John W. Jarve (1978) Seed Fund for Science Innovation, and the Earl A Killian III (1978) and Waidy Lee Fund.
Appendix A. Probabilistic formulation of physics-informed ice rigidity inference
We aim to formulate an objective function for estimating a probability distribution for ice rigidity fields θ(x) on ice shelves, conditioned on noisy observations of surface velocity and ice thickness (or thickness derived from surface elevations using a hydrostatic assumption). Additionally, we aim to construct mean fields (i.e. smoothed fields) for ice surface velocity and thickness that are consistent with the observations and their uncertainties while also minimizing the SSA momentum balance residuals for random samples of θ. To that end, let us first define an augmented parameter vector for a given ice shelf, ${\bf m} = [ \theta ,\; \, \hat {{\bf d}}]$, where $\hat {{\bf d}} = [ \hat {u},\; \, \hat {v},\; \, \hat {h}]$ is the vector of reconstructed surface observations. As stated in the main text, θ = θ(x) and $\hat {{\bf d}} = \hat {{\bf d}}( {\bf x})$ are implied, i.e. all scalar and vector quantities in this discussion are spatial fields spanning the ice shelf. Furthermore, we make no assumption on the parameterization of θ and $\hat {{\bf d}}$ as this section is focused on formulating the probabilistic relationships between modeled and observed fields. For our observations, we have d = [u, v, h], as well as a vector of pseudo-observations corresponding to the momentum balance residuals in Eqns (2 and 3), r = [r x, r y]. Recall that in our physics-informed machine learning framework, r = [0, 0] as we seek to learn θ that minimizes the SSA momentum balance residuals for a given set of surface predictions $\hat {{\bf d}}$.
We use Bayes’ Theorem to obtain an expression for the posterior distribution of m given our combined observations:
The first two terms on the right-hand side correspond to the joint data likelihood, which encodes the probability of having ‘observed’ d and r for a given m within the SSA equations. The third term corresponds to the prior distribution which encodes our prior knowledge on m values without having seen any observations. We can expand and simplify the above posterior distribution by utilizing what we know about the dependencies between the augmented parameters and observations in the physics-informed machine learning framework. Firstly, the surface observations d are only being reconstructed by a regression model (e.g. the neural network f ψ) that does not explicitly depend on θ; thus, d is only dependent on $\hat {{\bf d}}$. Similarly, the prior can be expressed as $p( {\bf m}) = p( \theta ) p( \hat {{\bf d}})$ since the predictions $\hat {{\bf d}}$ are again generated by an independent regression model and are independent of the prior values for θ. Furthermore, we can prescribe a uniform prior for $\hat {{\bf d}}$ as no explicit regularization is imposed on its values. We can therefore write the posterior distribution for θ and $\hat {{\bf d}}$ as:
which matches Eqn (5) in the main text.
A.1. Variational lower bound to posterior distribution
As discussed in the main text, we use a variational inference framework wherein we aim to construct an approximating, variational distribution q(θ, $\hat {{\bf d}})$ that is minimally divergent from the true posterior $p( \theta ,\; \, \hat {{\bf d}} \vert {\bf d},\; \, {\bf r})$. Generally, the variational distribution is a parametric distribution that is easy to sample from and has a likelihood that can be easily evaluated. Therefore, choosing an appropriate variational distribution will depend on the desired level of computational complexity (e.g. number of trainable parameters) and the expressiveness of the distribution for capturing the relevant features and statistics of the target posterior. Expected features such as skewness, multiple modes, correlations between parameters, etc., will impact the choice of variational distribution. Posterior distributions with highly complex probability topologies tend to be ill-suited for variational inference and will likely require MCMC-based methods. Even for less complex posteriors, variational distributions that are overly simple, such as an independent normal for each parameter, will fail to capture parameter correlations and can lead to significantly underestimated posterior variances (Blei and others, Reference Blei, Kucukelbir and McAuliffe2017). In this work, as discussed in Appendix B, we choose to use multivariate normal distributions in which the elements of the mean vector and covariance matrix are the learnable parameters.
A commonly used metric for quantifying the similarity between variational and target distributions is the Kullback–Leibler (KL) divergence, which for our problem is defined as:
It is important to note that the above equation is referred to as the reverse-KL-divergence, denoted as ${\cal KL}[ q \Vert p ]$ for brevity, as opposed to the forward-KL-divergence, ${\cal KL}[ p \Vert q ]$. As documented in previous studies (e.g. Huszár, Reference Huszár2015; Albergo and others, Reference Albergo2021), training the variational distribution with a reverse-KL divergence loss encourages mode-seeking behavior (but does not enforce it like the Laplace approximation), i.e. the distribution will be centered closer to regions with the highest posterior probabilities. A forward-KL divergence tends to encourage mean-seeking behavior. For inference of rigidity with posterior distributions that are generally unimodal and well-approximated by Gaussian distributions, the two KL-divergences should lead to similar variational distributions. However, the reverse-KL formulation leads to a more computationally efficient variational lower bound, which we now turn to.
The posterior $p( \theta ,\; \, \hat {{\bf d}} \vert {\bf d},\; \, {\bf r})$ in the denominator of the second term in Eqn (A3) causes the integration over all θ to be intractable. In this case, it can be shown (e.g. Titsias, Reference Titsias2009; Hensman and others, Reference Hensman, Fusi and Lawrence2013; Matthews and others, Reference Matthews, Hensman, Turner and Ghahramani2016; Blei and others, Reference Blei, Kucukelbir and McAuliffe2017) that minimization of the KL-divergence can be replaced by maximization of a stochastic variational lower bound, often referred to as the Evidence Lower Bound (ELBO):
The first term of the ELBO is the expected value of the joint data log-likelihood, integrated over θ and $\hat {{\bf d}}$. The second term is the KL-divergence between the variational distribution and the prior distribution. Thus, the ELBO can be interpreted as an optimization objective that encourages the variational distribution $q( \theta ,\; \, \hat {{\bf d}})$ to produce samples of θ and $\hat {{\bf d}}$ that best fit the surface observations and minimize the SSA momentum balance residuals while also maintaining consistency with the prior distribution. We can further decompose the variational distribution to be $q( \theta ,\; \, \hat {{\bf d}}) = q( \theta ) q( \hat {{\bf d}})$, i.e. the rigidity field and surface predictions are modeled independent of each other. The independence allows us to use different machine learning models for predicting the surface variables and rigidity field. Here, we use a neural network f ψ for the former and a variational Gaussian Process for the latter. Moreover, as the network f ψ only predicts the mean $\hat {{\bf d}}$ values, $q( \hat {{\bf d}})$ is equivalent to a δ-distribution, e.g. $\int x \delta ( x - \hat {x}) \, {\rm d}x = \hat {x}$. Additionally, we prescribe a uniform prior for $\hat {{\bf d}}$ such that $p( \hat {d}_i) = {1\over b_i - a_i}$ where $a_i \le \hat {d}_i \le b_i$ for the i-th component of $\hat {{\bf d}}$. These formulations allows the KL-divergence in Eqn (A.4) to be simplified in the following manner:
where ${\cal C}$ is a constant resulting from the uniform prior $p( \hat {{\bf d}})$. Altogether, the ELBO can be written as:
which is Eqn (7) in the main text. Note that we exclude the constant ${\cal C}$ from the above equation, but generally, one can include a multiplicative factor on the remaining KL-divergence to tune the regularization effect of the prior on the posterior inference.
The expectation operator on the log-likelihood of $p( {\bf r} \vert \theta ,\; \, \hat {{\bf d}})$ can be numerically estimated using either a Monte Carlo integration or a quadrature-based integration. For high-dimensional integrals, Monte Carlo methods are favorable over quadrature methods, which require an exponential amount of quadrature points in the number of dimensions. However, the SSA residual vector r can be evaluated at each point on the ice shelf independently, provided that gradients of θ and $\hat {{\bf d}}$ are available in order to compute relevant stress gradients. Both the neural network and variational Gaussian Process models permit the computation of spatial gradients on a coordinate-by-coordinate basis by utilizing automatic differentiation. Therefore, the expectation of $\log p( {\bf r} \vert \theta ,\; \, \hat {{\bf d}})$ can be decomposed into a series of one-dimensional integrals. We follow the approach of Dillon and others (Reference Dillon2017) and evaluate each 1D integral with a Gauss–Hermite quadrature with 3 quadrature points, which is exact for 1D Gaussian log-likelihoods:
where n is the number of quadrature points, w i are the quadrature weights for the quadrature grid points g i, $\hat {\sigma }_\theta$ is the predicted posterior standard deviation for θ, and $\hat {\mu }_\theta$ is the predicted posterior mean for θ. Here, we compute w i and g i using the NumPy function np.polynomial.hermite.hermgauss (Harris and others, Reference Harris2020).
Appendix B. Variational Gaussian processes
In this work, we use a variational Gaussian Process (VGP) to model the variational distribution q(θ) introduced in Appendix A. Gaussian Processes (GPs) are a class of machine learning models that describe a set of random variables, $\{ f( {\bf x}) \vert {\bf x} \in {\cal X}\}$, where x is a finite set of index points (i.e. spatial or temporal coordinates) from ${\cal X}$. GPs assume that every finite collection of random variables follows a multivariate normal distribution (Rasmussen, Reference Rasmussen2003). Thus, to describe a GP, one needs to specify a mean function, μ(x), and a covariance, or kernel, function k(x, x′) that describes the pairwise similarity between index points x and x′. In this work, we use a squared exponential kernel function for both posterior inference (Eqn 9) and for constructing the prior distribution (Eqn 13). The difference in usage between the two is that for posterior inference, σ and L are tunable hyperparameters, whereas for the prior, $\sigma _\theta$ and $L_\theta$ are fixed values.
Given a vector of training data, ${\bf y} = f( {\bf x}_t) + {\boldsymbol \epsilon }$, where xt represent the training index points and ${\boldsymbol \epsilon }$ represents independent Gaussian observation noise, the mean and covariance functions can be inferred via Bayesian inference in order to to estimate a GP posterior. For predictive points x, i.e. index points not used for training, the closed-form expressions for the GP posterior mean, μ p(x), and covariance, k p(x, x′), functions are:
where we adopt the shorthand notation K xx′ = k(x, x′). The above formulations therefore allow for generation of samples of f(x) at any arbitrary index points x by computing a mean and covariance matrix for a multivariate normal distribution.
For a training set of size N, the time complexity of GP inference scales as ${\cal O}( N^3)$ with a storage complexity of ${\cal O}( N^2)$, primarily due to the need for computing and inverting a covariance matrix of size N × N in Eqn ( B7). Therefore, when N exceeds several thousand data points, the computational requirements become excessive, even for modern computing platforms. VGPs address this limitation by building a low-dimensional representation of the dataset by introducing a set of inducing variables, which are indexed by a set of M coordinates, z, which live in the same space as ${\cal X}$ (Quinonero-Candela and Rasmussen, Reference Quinonero-Candela and Rasmussen2005; Titsias, Reference Titsias2009). Importantly, M ≪ N, such that the inducing variables serve as a much smaller set of latent observations that are used in the above GP inference equations rather than the full training data. In the formulation presented by Hensman and others (Reference Hensman, Fusi and Lawrence2013) which we use here, the locations of z are unknown and must be estimated during training of the VGP. Two key approximations are used for the inducing variables: (1) function values at non-inducing points, x, are mutually independent given function values at the inducing index points, z; and (2) the posterior distribution p(f(z)|y), which can be expensive to compute for large N, can be approximated with a multivariate normal distribution with mean ${\bf m} \in {\opf R}^{M \times 1}$ and covariance ${\bf S} \in {\opf R}^{M \times M}$. The VGP posterior predictive mean, μ q(x), and covariance, k q(x, x′), functions then take the form:
where ${\bf B} = K_{{\bf zz}}^{-1} {\bf S} K_{{\bf zz}}^{-1}$. The mean and covariance are then used to form the variational distribution for the ice rigidity, $q( \theta ( {\bf x}) ) = {\cal N}( \mu _q( {\bf x}) ,\; \, k_q( {\bf x},\; \, {\bf x}) )$. The above equations now have a time complexity of ${\cal O}( NM^2)$ and a storage complexity of ${\cal O}( NM)$. Note that one still cannot evaluate the variational kernel function for exceedingly large x, which may be the case for large ice shelves in Antarctica. For this reason, we predict covariance matrices for finite blocks of the modeling domain in order to seed a Gibbs sampler, as explained in the Methodology section of the main text Eqns (16a–16c).
The total set of trainable variables for the VGP, ϕ, consists of the inducing point locations, z, the inducing point mean values, m, the full-rank inducing point covariance matrix, S, and the hyperparameters of the base kernel function (σ and L of the squared exponential kernel). As we only need to estimate the lower-triangular portion of S, the number of covariance parameters becomes ${M( M + 1) \over 2}$. Furthermore, since the diagonal values of S must strictly be positive, we apply a sotfplus transformation followed by small shift of 1e−5 in order to avoid numerical issues (see the FillScaleTriL bijector implemented in TensorFlow Probability). The number of inducing points, M, is a prescribed parameter that depends on the complexity of the signal being inferred. For example, for the ice shelves studied in this work, we find that M between 300 and 1000 provided a reasonable trade-off between computational complexity and resolution of the inferred θ(x). The optimal value of M will also depend on the choice of the length scale $L_\theta$ for the prior distribution. Smaller $L_\theta$ will lead to θ(x) with more spatial variability, which will require a larger M.
Optimization of the variational parameters is achieved by inserting the variational distribution $q( \theta ( {\bf x}) ) = {\cal N}( \mu _q( {\bf x}) ,\; \, k_q( {\bf x},\; \, {\bf x}) )$ into the ELBO in Eqn (10). Note that the last term of the ELBO is the KL-divergence of q(θ) from the prior evaluated at the inducing points, z, which is consistent with the VGPs formulated by Titsias (Reference Titsias2009) and Hensman and others (Reference Hensman, Fusi and Lawrence2013). With this formulation, q can be constructed using the variational mean and covariance directly, $q( \theta ( {\bf z}) ) = {\cal N}( {\bf m},\; \, {\bf S})$ (note how Eqn (B8) simplifies to m and S when K xz = K zz). Empirically, we found that ${\cal KL}[ q_{{\boldsymbol \varphi }}( \theta ) \Vert p( \theta ) ]$ can also be well-approximated by using Eqn (B8) to construct q(θ(xc)) and p(θ(xc)) at a mini-batch of collocation coordinates xc, assuming the mini-batch of coordinates are roughly uniformly distributed throughout the modeling domain. This approach can improve training time when the number of inducing points is significantly larger than the batch size.