1. Introduction
Bayesian inference and parameter estimation are the cornerstones of gravitational-wave astronomy. The Bayesian framework is used to derive posterior distributions for parameters such as the masses and spins of merging pairs of neutron stars and black holes. Examples range from the progenitor of the first detected gravitational wave, GW150914 (Abbott et al. Reference Abbott2016b) and the binary neutron star merger, GW170817 (Abbott et al. Reference Abbott2017), to the many sources now presented in the catalogues of gravitational-wave transients published by the LIGO–Virgo–KAGRA Collaboration (Abbott et al. Reference Abbott2019a; Abbott et al. Reference Abbott2021d,Reference Abbotta,Reference Abbottb) and independent authors (e.g., Venumadhav et al. Reference Venumadhav, Zackay, Roulet, Dai and Zaldarriaga2019, Reference Venumadhav, Zackay, Roulet, Dai and Zaldarriaga2020; Olsen et al. Reference Olsen, Venumadhav, Mushkin, Roulet, Zackay and Zaldarriaga2022). Using Bayesian inference, we obtain values for the marginal likelihood (also known as the evidence), which are used for model selection. Model selection has been used, for example, to compare models of the neutron star equation of state (Abbott et al. Reference Abbott2020b), assess the most likely progenitors of intermediate-mass black hole merger GW190521 (Abbott et al. Reference Abbott2020c; Romero-Shaw et al. Reference Romero-Shaw, Lasky, Thrane and Calderón Bustillo2020; Bustillo et al. Reference Bustillo2021), and study gravitational-wave events for evidence of modified gravity (e.g., Ghosh 2022).
A second layer of Bayesian analysis is built upon this foundation to study the population properties of merging binaries. Hierarchical models are used to estimate population hyper-parameters describing how sources of gravitational waves are distributed according to mass, spin, redshift, and so on. This has been key in, for example, the discovery of features in the distribution of binary black hole masses (Abbott et al. Reference Abbott2020a, Reference Abbott2021c).
Results obtained with Bayesian inference are only as reliable as the underlying model. The compact-object binary parameters reported in gravitational-wave transient catalogues are derived using models that describe physical systems; only if these models are sufficient descriptors of the true system can these results be meaningful. Bayesian inference can tell us that one model is a better explanation for the data than another. For example, Bayesian techniques have been used to suggest that intermediate-mass black hole merger GW190521 shows signs of non-zero orbital eccentricity (Romero-Shaw et al. Reference Romero-Shaw, Lasky, Thrane and Calderón Bustillo2020). However, Bayesian inference on its own does not tell us if either the quasi-circular or eccentric gravitational waveforms considered provide an adequate fit to the GW190521 data. Similarly, Bayesian inference has been used to suggest that the distribution of primary black hole mass is better fit by a broken power-law distribution than a power-law distribution with no break (Abbott et al. Reference Abbott2020a). However, Bayesian inference on its own does not tell us if either of these models is adequate to describe the observed distribution of primary black hole masses.
As the gravitational-wave catalogue grows and gravitational-wave detector sensitivity improves, we begin to see more events that push the boundaries of our understanding of the Universe. This makes it ever more important to test the validity of our models. A signal model that is valid for systems with mass ratios $q \geq 0.125$ may be invalid for a mass ratio of $q = 0.001$ . A detector noise model adequate for an event with signal-to-noise ratio $\text{SNR}=30$ may inadequately represent the detail contained within an $\text{SNR}=100$ signal. Additionally, as the number of gravitational-wave signal detections grows, the resolving power of the combined dataset increases. This makes it ever more important to test the validity of population models. A population model for the distribution of binary black hole redshifts that works reasonably well for a dozen events may be unsuitable for a catalogue with hundreds of events.
In this Article, we describe different ways in which models can fail and lay out commonly used tests that can be carried out to reveal these failures, often beginning with visualisation. For a broader discussion of how visualisation can assist in solving Bayesian inference problems, see Gabry et al. (Reference Gabry, Simpson, Vehtari, Betancourt and Gelman2017). We discuss workarounds for misspecified models, including model redesign and data coarsening (Miller & Dunson Reference Miller and Dunson2019; Thomas & Corander Reference Thomas and Corander2019). For an idea of how Bayesian inference problems may be solved through a careful workflow and iterative model redesign of the prior and likelihood, see both Betancourt (Reference Betancourt2020) and Gelman et al. (Reference Gelman2020). While we cast many examples in the language of gravitational-wave astronomy, we endeavour to use sufficiently general language so that this review is useful to a broad audience. For additional resources, we refer the reader to Chapters 6-7 of Gelman et al. (Reference Gelman, Carlin, Stern, Dunson, Vehtari and Rubin2013), respectively devoted to ‘Model checking’ and ‘Evaluating, comparing, and expanding models.’ See also ‘Model Checking and Sensitivity Analysis’ in Andreon & Weaver (Reference Andreon and Weaver2015).
Almost all of the Subsections in this review follow the same formula. After introducing a concept, we provide a bullet list of recommended tests. This list is followed by a demonstration, which illustrates the tests with simple examples. This layout is designed to help researchers scan the Article to quickly find the misspecification tests they are looking for. Our recommended tests do not constitute an exhaustive list of the ways in which one may test for misspecification. In addition, an analysis that passes all of these tests may still suffer from misspecification—there is no silver bullet to detect all forms of misspecification! Furthermore, models of complex physical processes will always be somewhat deficient; as statistician George Box famously stated, ‘all models are wrong, but some are useful’ (Box Reference Box1976). Nonetheless, the tests recommended here provide a starting point for checking the suitability of models for the task at hand: that is, to provide a ‘good enough’ description of the data that meaningful inferences can be made. All of the demonstration code is available in jupyter notebook form here: tinyurl.com/bf4n9vw5. There is a dedicated notebook for each Section.
Broadly speaking, two different kinds of models are required to do an inference calculation. Every such calculation requires a model for the distribution of the data—the likelihood function—and a model for the distribution of the parameters—the prior. The remainder of this Article is organised as follows. In Section 2, we discuss the importance of data visualisation. In Section 3, we describe misspecification of the data: misspecified likelihood functions. In Section 4, we describe misspecification of population models: misspecified priors. In Section 5, we describe how apparent outliers may or may not be signs of model misspecification. We provide closing thoughts in Section 6.
2. Preface: The importance of visualisation
Below we describe many goodness-of-fit tests that can be used to determine the suitability of different models. However, even the most carefully crafted tests are no replacement for sanity checks with visualisation. Plotting data, we can sometimes see obvious model failures that we might not have thought to check a priori. The importance of visualisation is dramatically illustrated by Anscombe’s Quartet (Anscombe Reference Anscombe1973): four 11-point datasets with noticeably different trends that nonetheless have near-identical simple descriptive statistics.Footnote a
The four datasets that comprise Anscombe’s Quartet are plotted in Figure 1. By studying these graphs, one can begin to diagnose anomalies: for example, two of the data sets each contain a single outlier that skews the correlation coefficient (lower left) or implies the existence of a relationship that is not supported by the rest of the data (lower right). We can also see that the dataset in the top right would be better-specified by a non-linear relationship between x and y. The Quartet is a cautionary tale to those who wish to establish the ‘goodness’ of their model: if one’s model does not well-specify one’s data, then the calculated ‘goodness’ metric is not to be trusted. The starting point for any exploration of misspecification, therefore, should be to visually compare the model and the data.
3. Likelihood misspecification
3.1. Basics
Models for the data are built on assumptions about the nature of both the noise and signal being measured. The data model is described by the likelihood function
where d is the data and $\theta$ is a set of parameters describing the noise and/or signal.Footnote b The likelihood function is a normalised probability density function for the data, not for the parameters $\theta$ (see Thrane & Talbot Reference Thrane and Talbot2019, for more details):
It is useful to define a marginal likelihood, which is also known as the Bayesian evidence:
Here, $\pi(\theta)$ is the prior distribution for the parameters $\theta$ . The marginal likelihood is a model for the data, averaged over realisations of $\theta$ .
A well-known example of a model for gravitational-wave data is the Whittle likelihood model for Gaussian time-series noise, shown here for a single frequency bin:
Here, $\tilde{d}$ represents the frequency-domain gravitational-wave strain while $\sigma^2$ is related to the noise power spectral density (PSD) P and the frequency bin width $\Delta f$ :
Equation (5) is an example of a parameter-free model of the data. If, for example, a gravitational-wave signal from a compact binary coalescence is present, then the likelihood depends on the ${\gtrsim}15$ parameters associated with a compact binary coalescence (component masses, spins, etc.),
Here, $\tilde{h}_k(\theta)$ is a model for the frequency-domain strain from a gravitational-wave signal given binary parameters $\theta$ in frequency bin k. Sometimes, $h_k(\theta)$ , which can be defined in either the time domain or the frequency domain, is referred to as ‘the waveform model.’
Examining Equation (7), we can see various ways in which the likelihood can be misspecified, which we represent diagrammatically in Figure 2. First, the waveform $h(\theta)$ may be misspecified, which can lead to well-documented systematic error (Ohme et al. Reference Ohme, Nielsen, Keppel and Lundgren2013; Wade et al. Reference Wade, Creighton, Ochsner, Lackey, Farr, Littenberg and Raymond2014; Ashton & Khan Reference Ashton and Khan2020; Gamba et al. Reference Gamba, Breschi, Bernuzzi, Agathos and Nagar2021; Huang et al. Reference Huang, Haster, Vitale, Varma, Foucart and Biscoveanu2021). This is an example of a misspecified signal model. Second, the noise model can be misspecified. There are typically two ways that this can happen. One possibility is that the functional form of the likelihood is correct, but the noise PSD is misspecified. A number of papers have proposed different methods for estimating the noise PSD in order to minimise this form of misspecification, for example, Littenberg & Cornish (Reference Littenberg and Cornish2015), Cornish & Littenberg (Reference Cornish and Littenberg2015), Chatziioannou et al. (Reference Chatziioannou, Haster, Littenberg, Farr, Ghonge, Millhouse, Clark and Cornish2019).
The other possibility is that the functional form of the likelihood is itself misspecified. This may occur because of non-Gaussian noise artefacts (Röver, Meyer, & Christensen Reference Röver, Meyer and Christensen2010) or uncertainty in the noise PSD (Talbot & Thrane Reference Talbot and Thrane2020; Biscoveanu et al. Reference Biscoveanu, Haster, Vitale and Davies2020; Banagiri et al. Reference Banagiri, Coughlin, Clark, Lasky, Bizouard, Talbot, Thrane and Mandic2020), both of which yield broader tails than the Whittle distribution. Likewise, marginalising over calibration uncertainty broadens the likelihood function (Sun et al. Reference Sun2020; Payne et al. Reference Payne, Talbot, Lasky, Thrane and Kissel2020; Vitale et al. Reference Vitale, Haster, Sun, Farr, Goetz, Kissel and Cahillane2021).Footnote c Covariance between neighbouring frequency bins induced from finite measurements of continuous noise processes can also lead to misspecification if not correctly accounted for (Talbot et al. Reference Talbot, Thrane, Biscoveanu and Smith2021; Isi & Farr Reference Isi and Farr2021). In the subsequent Subsections, we describe tests for these different forms of misspecification.
3.2. Testing for a misspecified signal model
In order to test for a misspecified waveform, it is sometimes useful to look at the whitenedFootnote d residuals of the data in frequency
and time
where ${\mathcal{F}}^{-1}$ is the discrete inverse Fourier transform. Residuals can be useful to test for waveform misspecification because the differences between waveform models are clearly seen in the time and frequency domain.Footnote e Additionally, if there is a terrestrial noise artefact (glitch) present in the data, it is likely to appear clearly in the residuals. For example, in Abbott et al. (Reference Abbott2016b), the best-fit, time-domain residuals for GW150914 were shown to be consistent with Gaussian noise (see Figure 1 of Abbott et al. Reference Abbott2016b), helping to show that the data are well explained by a gravitational waveform in Gaussian noise.
Recommended tests:
-
• Plot the whitened residuals in both time r(t) and frequency $|\tilde{r}(f)|$ and visually inspect for consistency with zero. In the frequency domain, the set of $\tilde{r}(f)$ are approximately independent measurements with Gaussian uncertainty equal to unity. Include residual curves for many posterior-distribution draws of $\theta$ in order to show theoretical uncertainty. A word of caution: the time-domain residuals are highly covariant in real data, and so it is less straightforward to interpret misspecification in the time domain than in the frequency domain.
-
• As a first step, look for consistency by eye. If there are signs of misspecification, one may quantify the inconsistency, for example, ‘the residuals are five standard deviations away from zero at $500\ \mathrm{Hz}$ .’ While post hoc analysis is helpful for catching egregious misspecification, in an ideal world, one should define the consistency tests a priori for unbiased tests. In practice, this is not always possible.
Demonstration: Footnote f Our signal model is a sine-Gaussian chirplet (i.e., a sine wave multiplied by a Gaussian function). We create two synthetic sets of data with Gaussian noise. The correctly specified data contains a signal that matches our model. The second dataset contains an intentionally misspecified signal: the same sine wave as before, but multiplied by a Tukey window. In both datasets, we assume Gaussian noise with a known power spectral density. In Figure 3, we plot these two simulated signals.
It is worth pausing to distinguish this discussion of model misspecification from the similar-but-different topic of model selection. If we were discussing model selection, we would keep the data fixed and compare it two different signal models. However, misspecification occurs when the analyst has not conceived of the correct model to test. Therefore, since we are discussing misspecification, we keep the model fixed and consider two hypothetical datasets: one correctly specified and one misspecified.
In the right-hand panel of Figure 4a, we display the time-domain residuals obtained when we subtract the sine-Gaussian waveform model from the misspecified data. In the left-hand panel, we show the time-domain residuals obtained from subtracting the same waveform model from the correctly specified data. If the waveform is correctly specified, the residuals should be consistent with our Gaussian noise model as in the left panel. Although it is sometimes possible to see misspecification in the time-domain data (for example, when there is a short glitch), the misspecification may be more apparent in the frequency domain as is the case here.
In Figure 4b, we plot the amplitude spectral density of the residuals. Again, the left-hand panel shows the residuals for the correctly specified signal while the right-hand panel shows the residuals for the misspecified signal. While the correct waveform yields residuals consistent with the noise amplitude spectral density, the misspecified signal produces a peak in the amplitude spectral density that is conspicuously outside of the 90% range predicted by the model (shown in pink).
In order to assess the overall goodness of fit, it can be useful to plot the cumulative density function (CDF) of the residuals alongside the theoretical CDF predicted by the model. The CDF may be constructed from residuals in the time domain (to highlight transient phenomena), but more often, misspecification is most apparent using whitened frequency-domain residuals. An illustration of the CDF test is provided in Figure 4c, where we plot the CDF of the time-domain residuals. While there is only one realisation of the data (grey), we can generate arbitrarily many realisations of the theoretical CDF (pink). The thickness of the pink CDF shows the variability from 100 different realisations.(This is similar to, although not the same as, Gelman’s posterior predictive checks (e.g., Gelman et al. Reference Gelman, Carlin, Stern, Dunson, Vehtari and Rubin2013; Gelman & Rohilla Shalizi Reference Gelman and Rohilla Shalizi2010). Gelman’s checks involve drawing realisations from the histogram of the posterior probability distribution under the assumption of the model, and checking how probable it is for the model to produce realisations that are consistent with the observed data. This is something that we return to in Section 4. Here, we draw realisations from the noise model in order to establish the range over which it can credibly vary.)
To determine if the residuals agree with the model, one can employ any of the many established hypothesis-testing tools available to determine if measured samples are drawn from the theoretical distribution. For example, the Kolmogorov-Smirnov (KS) statistic,
is the maximum difference between the measured CDF of the data and the predicted CDF given n frequency bins. (The Anderson-Darling test is also commonly used to determine if measured samples are drawn from a predicted distribution.) The $D_n$ statistic can be converted into a p-value with a look-up table that does not depend on the functional form of the CDF. The Kolmogorov-Smirnov p-values for the correctly specified data and misspecified dataset are provided in the legends of Figure 4c. The small p-value in the right panel suggests that our signal model is indeed misspecified.
Sometimes one may wish to inspect the data within a particular time or frequency interval. In this case, it can be enlightening to draw hundreds of distributions from the model, and count the number of times that the model CDF in this bin is above the CDF of the data. The fraction of draws above the CDF in this bin constitutes a p-value; if the data are correctly specified, the p-value will follow a uniform distribution. Thus, a p-value very close to 0 or 1 indicates that the data in this bin deviate significantly from the model. However, care must be taken if more than one p-value is calculated in the same set of data—while each p-value is uniformly distributed, the set of p-values from one dataset are correlated with each other.
Both of the tests described above assume that the predicted model is non-parametric. That is, the distribution of the residuals does not depend on any parameters. If the predicted distribution depends on one or more parameter $\theta$ , then we must rely on Monte Carlo methods to determine if the distributions agree. For example, we may calculate
which is a KS-type statistic using the maximum-likelihood parameters $\widehat\theta$ . Since we use the data to estimate $\widehat\theta$ , $D'_n$ is not distributed according to the Kolmogorov distribution. (It is easier to get smaller differences between the measured and predicted CDFs when the theoretical CDF varies depending on the parameters.) However, we can still calculate a p-value by generating synthetic data to empirically estimate the distribution of $D'_n$ , which amounts to a generalisation of Lilliefors test (Lilliefors Reference Lilliefors1967). We demonstrate such a calculation below in Subsection 4.2 in the context of prior misspecification; see in particular Figure 7a.
3.3. Testing for a misspecified noise model
In order to observe noise misspecification and deviations from the Whittle likelihood, it is again useful to look at the distributions of residuals. By comparing the observed residuals to the theoretical likelihood distribution, it is possible to see, for example, if five-sigma deviations are more common than expected.
Recommended tests:
-
• Create a histogram representing the probability density function of the data; or, alternatively, plot the CDF of $r,|\tilde{r}|$ , the whitened residuals. It is sometimes also useful to plot the distribution of the whitened residual power $|\tilde{r}|^2$ ; see Talbot & Thrane (Reference Talbot and Thrane2020), Chatziioannou et al. (Reference Chatziioannou, Haster, Littenberg, Farr, Ghonge, Millhouse, Clark and Cornish2019). Include residual curves for many posterior distribution draws of $\theta$ in order to show theoretical uncertainty.
-
• For fixed models with no parameters, calculate a goodness-of-fit statistic using, for example, the KS test. When models have parameters, calculate the goodness of fit for the maximum-likelihood parameters.
-
• Plot the difference in the cumulative density functions (empirical - expected) as a function of the expected CDF as in Talbot & Thrane (Reference Talbot and Thrane2020), Chatziioannou et al. (Reference Chatziioannou, Haster, Littenberg, Farr, Ghonge, Millhouse, Clark and Cornish2019). This is like a probability–probability (’PP’) plot, in which the fraction of repeated measurements is plotted against the confidence level at which the known truth value exists, for data model checking.
-
• Bootstrapping methods—in which real data is used as a sampling distribution for synthetic data—can often be helpful for diagnosing noise misspecification. In gravitational-wave astronomy, new noise realisations can be created by bootstrapping data residuals (Cannon, Hanna, & Keppel Reference Cannon, Hanna and Keppel2013; Cannon, Hanna, & Peoples Reference Cannon, Hanna and Peoples2015; Ashton, Thrane, & Smith Reference Ashton, Thrane and Smith2019) or using time slides (Dal Canton et al. Reference Dal Canton2014; Usman et al. Reference Usman2016); both methods were integral to the first gravitational-wave detections (e.g., Abbott et al. Reference Abbott2016a). However, all bootstrap methods ultimately break down due to saturation effects (Wąs et al. 2010); it is impossible to simulate all possible noise realisations using a finite dataset. In astro-particle physics, using instruments such as Super-Kamiokande, sidereal time scrambling is used to estimate typical fluctuations in signal strength due to noise (Thrane et al. Reference Thrane2009).
Demonstration: Footnote g We demonstrate how to diagnose noise misspecification. The true noise is Gaussian with a mean $\mu=0$ and standard deviation $\sigma=1$ . The misspecified noise is distributed according to the Student’s t distribution with $\nu=5$ degrees of freedom. These parameters are chosen so that the noise profiles appear, at first glance, to be consistent with each other.
We display histograms of each noise dataset in Figure 5a. The region that the model predicts 90% of the data to lie within is also shown in these plots. The histograms both appear to largely lie within this 90% credible region, with the Student’s t-distributed data only visibly deviating in the tails.
Next, we Fourier transform our datasets and compare the 90% range predicted by the noise model against histograms of the data in the frequency domain (Figure 5a). In the frequency domain, the misspecified data more clearly strays outside of the range predicted by the model. We then create a CDF of the frequency-domain data, comparing again to the 90% range predicted by the model (Figure 5b). We find a reasonable KS test p-value for the correctly specified data and an extreme p-value for the misspecified data, as expected. Finally, we compare the data to the model by plotting the data CDF against the difference between the data CDF and the model CDF (Figure 5c). In this final test, it is clear that the data is not well-represented by the model.
4. Prior misspecification
In the previous Section, we discussed various ways in which the likelihood can be misspecified and how we can detect this misspecification. Now we turn our attention to the misspecification of the prior, $\pi(\theta)$ , a distribution that describes our prior knowledge of the parameters $\theta$ .
4.1. Priors with no hyper-parameters
In some cases, we can be very confident in our priors. For example, since there is no preferred direction in the Universe, the best prior for inclination angle $\iota$ (the angle between the orbital angular momentum and the line of sight) is uniform in $\cos\iota$ . In other situations, we are not confident in the form of the prior distribution. In these cases, it can be useful to formalise this theoretical uncertainty using a conditional prior
Here, $\Lambda$ is a ‘hyper-parameter’ we may vary to alter the shape of the prior for $\theta$ . In this Subsection, we focus on priors with no hyper-parameters while we cover parameterised priors in the next Subsection.
Recommended tests:
-
• Make a CDF plot comparing the reconstructed distribution of $\theta$ to the expected distribution of $\theta$ . In order to obtain a reconstructed distribution, obtain posterior samples for N different events, each weighted with the population model to be tested. Draw one posterior sample from each of the N events to make a realisation of the reconstructed distribution. Do this many times to make many realisations. Plot CDFs of the many realisations alongside the population model being tested. If the data agrees with the model, the CDFs should overlap. If the two CDFs do not overlap, the model is a poor description of the data. It is worth noting that a model can pass this test while still being quite badly misspecified. Abbott et al. (Reference Abbott2021e) found evidence for negatively aligned spins in a population of compact-object binaries, but this was later shown to be a model-dependent feature (Roulet et al. Reference Roulet, Chia, Olsen, Dai, Venumadhav, Zackay and Zaldarriaga2021; Galaudage et al. Reference Galaudage, Talbot, Nagar, Jain, Thrane and Mandel2021). This test may be particularly unreliable if there is a sharp feature in the data that is not captured by the prior.
-
• If a more quantitative test is desired, one may calculate the KS-statistic for each CDF. If the best-fitting reconstruction has a p-value below a certain threshold, for example p-value $\leq0.00005$ (Benjamin et al. Reference Benjamin2017), then the model is not a good fit to the data.
-
• One may also identify problem areas of model under- or over-production by quantifying the discrepancies between the data draw CDFs and the model over the parameter space. This facilitates statements like ‘99% of the time, the model produces too many binary black hole events with $m_1>80\,{\rm M}_\odot$ ’.
Demonstration. Footnote h We simulate data d consisting of some physical parameter x and noise n:
The noise is Gaussian distributed with zero-mean and unit variance. The values of x are drawn from the true prior distribution: a Gaussian distribution with mean $\mu=10$ and width $\sigma=2$ . However, in order to demonstrate prior misspecification, we employ as our prior a Laplacian with the same mean and variance as the true distribution. The true prior distribution and the misspecified prior distribution are compared in Figure 6a.
We create a simulated dataset of $N=1000$ events. In each case, we construct Gaussian likelihood curves of mean $\mu_i = d_i$ . The posterior for each value of $x_i$ is proportional to the likelihood multiplied by the prior. We draw 50 posterior samples from each of these simulated posteriors in order to create 50 CDF curves; see Figure 6b. We make two versions of this plot: one with the correctly specified prior and one with the misspecified prior.
It is also sometimes useful to plot the data CDF minus the model CDF. We include an example of this plot in Figure 6c. In this case, we see that the misspecified prior yields disagreements in the CDF in the distribution tails; the difference between pink model and grey data is, in general, inconsistent with zero when the model CDF is ${\approx}0.1$ and ${\approx}0.9$ .
4.2. Hyper-parameterised priors
We are particularly concerned with population models (sometimes called ‘hierarchical models’), where the prior for the parameters $\theta$ is conditional on a set of hyper-parameters $\Lambda$ , describing the shape of the prior distribution:
While data models are built on assumptions about the nature of noise, population models are built on assumptions about astrophysics. For example, if we assume that the distribution of primary black hole mass $m_1$ follows a power-law distribution with spectral index $\alpha$ ,
then $\alpha\in\Lambda$ is a hyper-parameter for the distribution of $m_1 \in \theta$ .
Hyper-parameters are frequently a convenient way of describing systematic theoretical uncertainty. For example, consider a parameter x drawn from a Gaussian distribution ${\mathcal{N}}(x|\mu,\sigma)$ with mean $\mu$ and width $\sigma$ . If we do not know the precise value of $\mu$ or $\sigma$ , our misspecification tests should take this uncertainty into account. We repeat the tests from the previous Subsection for the case of a hyper-parameterised prior.
Here is a summary of the differences that arise from the addition of a hyper-parameter. One must generate random draws of the hyper-parameter $\Lambda$ . Then generate draws of the model CDF for each random draw of $\Lambda$ , and weight posterior samples using the population predictive distribution—the conditional prior, marginalised over uncertainty in the hyper-parameters:
Here $p(\Lambda | d)$ is the posterior for the hyper-parameters. Take care not to double-count; the posterior for $\Lambda$ used to reweight event k should not be informed by event k; see, for example, Essick et al. (Reference Essick, Farah, Galaudage, Talbot, Fishbach, Thrane and Holz2022).
Demonstration: Footnote i We assume a Gaussian prior for parameter x with uncertain $\mu$ and width $\sigma$ . We simulate $N=1000$ true events, $x_\mathrm{true}$ , from two populations (priors): one Gaussian-distributed and one Laplacian-distributed, both with the same means $\mu_\mathrm{G}$ , and with widths $\sigma_\mathrm{G}$ and $\sigma_\mathrm{L}$ , respectively. For each population, we calculate the maximum-likelihood detected value of each event, $x_\mathrm{meas}$ by offsetting it by a random number drawn from a Gaussian of the same width as the likelihood distribution, $\sigma_\mathrm{meas}$ . The mean average of these maximum-likelihood values is the maximum-likelihood estimate for the population prior mean, $\mu_\mathrm{E}$ , and the measured population prior width is $\sigma_\mathrm{E}=\sigma_\mathrm{meas} / \sqrt{N}$ . The estimated mean of the population posterior is described by
since $\mu = 0$ . The estimated width is
The width of the posterior predictive distribution is
the width of a cross product of the uncertainty distribution of width $\sigma$ and the posterior width $\sigma_\mathrm{P}$ . The population-weighted width for each individual event’s posterior is
We draw $M=100$ samples from each of these posteriors of mean $x_\mathrm{meas}$ and width $\sigma_\mathrm{x}$ . In practice, we do this by drawing N samples M times from the population posterior. The CDFs of these draws can then be compared to the model CDF.
Since we have used the data to estimate the hyper-parameters $(\mu, \sigma)$ , we cannot use the standard KS test to ascertain the degree of misspecification. Instead, we must define our own KS-like test for the p-value using the distance between the model and data. We calculate 100 KS distances between the model and draws from the model. We can histogram this data to show the distribution of KS distances expected if the model is a good fit to the data; see the pink histogram in Figure 7a. We measure the KS distance between the average data realisation and the data-conditioned model, which is shown by a grey line in Figure 7a. The p-value equates to the fraction of the model KS distance distribution above the data KS distance. In Figure 7b, we plot the 90% credible bands of the parameterised model and compare against the 90% credible range of the data. Despite being conditioned on the same data, the misspecified model strays outside of the credible band of the data and achieves a p-value of only 0.01. Finally, in Figure, we show the same CDFs with the CDF of the model subtracted, and plot against the model CDF. This has the effect of emphasising the extent of the mismatch between the data and the misspecified model, while the well-specified model contains the median data draw over the entire range.
4.3. Models with more than one dimension
It is difficult to look for model misspecification in more than one or two dimensions. Tests like those described in this Article so far can be extended, but the tell-tale signs of misspecification (e.g., detectable structure in residuals) can be difficult to see in large-dimensional spaces. However, there are some tools available.
-
• Make two-dimensional scatter plots. If the model does not have any parameters, it can be represented by contours while the data can be represented with two-dimensional error bars or credible intervals.
-
• If the model is subject to theoretical uncertainty (i.e., it is described by hyper-parameters), it may be necessary to draw multiple contours. However, the plot can be difficult to read if there are more than two variables.
-
• There is no standardised test for goodness of fit in two or more dimensions. Boutique tests designed to identify particular forms of misspecification can work well, but one must beware trial-factor penalties when designing the test after looking at the data.
5. More symptoms of model misspecification
5.1. Posterior stability
In the previous Sections we describe tests to find misspecification. However, misspecification is sometimes manifest from surprising results. One such example is a phenomenon we refer to as ‘posterior instability.’ Let us consider the posterior for some parameter (call it $\theta$ ) and imagine how this posterior changes as we accumulate data. On average, we expect the posterior to narrow as we include more information. It also tends to shift around, but only within the bounds of the previous credible regions. It would be surprising if the addition of data produced a posterior favouring $\theta=3$ when an earlier posterior (calculated with less data) disfavoured $\theta=3$ with high credibility.
In such cases we say that the posterior is not stable to the addition of data. This can be indicative of model misspecification. An example from gravitational-wave astronomy is the maximum black hole mass parameter, which proved to be unstable moving from GWTC-1 (Abbott et al. Reference Abbott2019b) to GWTC-2 (Abbott et al. Reference Abbott2020a), under the assumption that the primary black hole mass distribution is a power law with a sharp cut-off. When the mass model was improved to allow for additional features (deviations from a power law Talbot & Thrane Reference Talbot and Thrane2018), the maximum-mass parameter stabilised.Footnote j For another example of posterior instability from optical astronomy, see Liu et al. (Reference Liu, Gezari and Miller2018), Zhu & Thrane (Reference Zhu and Thrane2020).
5.2. Outliers
Outliers are symptom of misspecification closely related to posterior stability (Fishbach, Farr, & Holz Reference Fishbach, Farr and Holz2020; Essick et al. Reference Essick, Farah, Galaudage, Talbot, Fishbach, Thrane and Holz2022). An outlier is an event with a parameter value appearing inconsistent with the rest of the distribution—in the context of a particular population model. Let us imagine that the distribution of events in our dataset (characterised by parameter $\theta$ ) seem well-described by a normal prior with mean zero and unit variance: ${\mathcal{N}}(\theta|\mu=0, \sigma=1$ . If we subsequently observed an event with $\theta \gtrsim 10$ , this could indicate that our prior model (a normal distribution) is inadequate.
Commonly, population outliers are identified using a ‘leave-one-out’ analysis method that compares an inferred distribution with and without the potentially anomalous data point; see Fishbach et al. (Reference Fishbach, Farr and Holz2020). However, care must be taken to take into account trial factorsFootnote k since, in any catalogue, some event has to be the most extreme. The statistics required for a careful leave-one-out analysis are too complicated for us to summarise here. Instead we refer the reader to Essick et al. (Reference Essick, Farah, Galaudage, Talbot, Fishbach, Thrane and Holz2022), which describes a ‘coarse-graining’ method to identify outliers.
6. Discussion: Living with misspecification
Misspecified models can lead to flawed inferences. While it is not always practical, models should, when possible, be subjected to an array of visualisations and checks for misspecification. In an ideal world, models found to be misspecified should be improved so that they pass these checks. In particular, one may refine one’s model in order to better capture features of the data (e.g., Gao & Ho Reference Gao and Ho2017; Gabry et al. Reference Gabry, Simpson, Vehtari, Betancourt and Gelman2017), although this kind of post hoc data fitting can cause obvious biases in inferred results. Iteratively adding complexity can also cause computational costs to skyrocket, and each addition has the potential to add further misspecification.
Thus, in practice, model refinement is not always possible. For example, as data becomes more informative at higher signal-to-noise ratios (SNR), even subtle imperfections can lead to signs of misspecification. For example, if one attempts to fit a template to an optical image of a distant (but clearly resolved) galaxy, there will always be non-negligible residuals because our best templates are no match for the high signal-to-noise ratio of optical astronomy data.
One option in such cases is to apply coarsening to blur the data. For example, one may employ a coarsened posterior, conditional on the distance between the model and the data being below some threshold, but not zero (see Miller & Dunson Reference Miller and Dunson2019, for a detailed description of posterior coarsening). An alternative form of coarsening is to add a non-parameteric error term to the revised model to account for unknown or independent influences in the data; for example, Bhatt et al. (Reference Bhatt, Cameron, Flaxman, Weiss, Smith and Gething2017) use a Gaussian random field to generate multiplicative factors to the terms in their parameterised model for malaria mapping.
While the low-SNR regime has its own inference challenges, it can be more forgiving when confronted with misspecification. For example, the inclusion of nuisance parameters has helped to understand high-energy gamma-ray emission from the galactic centre (Storm, Weniger, & Calore Reference Storm, Weniger and Calore2017; Armand & Calore Reference Armand and Calore2021; Bartels et al. Reference Bartels, Storm, Weniger and Calore2018), resolve temporally varying structure in the brightness profile of the accretion disc around a supermassive black hole (Arras et al. Reference Arras, Frank, Haim, Knollmüller, Leike, Reinecke and Enßlin2022) and to assess the existence of a correlation between COVID-19 viral load and age (Guardiani et al. Reference Guardiani, Frank, KostiĆ, Edenhofer, Roth, Uhlmann and Enßlin2021).
Although we have described a number of ways in which models can be tested for misspecification, one must ultimately accept that all physical models are—to some degree—misspecified (Box Reference Box1976). The question, therefore, is not whether a model is wrong, but whether it is adequate. If a model does a ‘good enough’ job of describing a signal (that is, it is not obviously misspecified), then we may still able to use it to inform us about the Universe.
Acknowledgments
The authors would like to thank those who took part in the discussion during the When is a model good? session at the 2021 OzGrav Annual Retreat, in particular Rory Smith, who co-organised and chaired the session. We would also like to thank Ewan Cameron for his comments on the manuscript. Additionally, we are grateful to our anonymous reviewer, who provided many useful insights and suggestions. This work is supported through Australian Research Council (ARC) Future Fellowship FT160100112, Centre of Excellence CE170100004, and Discovery Project DP180103155.