Hostname: page-component-cd9895bd7-jkksz Total loading time: 0 Render date: 2024-12-25T18:05:32.699Z Has data issue: false hasContentIssue false

Relaxation of the parameter independence assumption in the bootComb R package

Subject: Mathematics, Statistics and Probability

Published online by Cambridge University Press:  13 September 2022

Marc Y. R. Henrion*
Affiliation:
Malawi Liverpool Wellcome Programme, Statistical Support Unit, Blantyre, Malawi Liverpool School of Tropical Medicine, Liverpool, United Kingdom
*
*Corresponding author. Email: [email protected]

Abstract

Background

The bootComb R package allows researchers to derive confidence intervals with correct target coverage for arbitrary combinations of arbitrary numbers of independently estimated parameters. Previous versions (<1.1.0) of bootComb used independent bootstrap sampling and required that the parameters themselves are independent—an unrealistic assumption in some real-world applications.

Findings

Using Gaussian copulas to define the dependence between parameters, the bootComb package has been extended to allow for dependent parameters.

Implications

The updated bootComb package can now handle cases of dependent parameters, with users specifying a correlation matrix defining the dependence structure. While in practice it may be difficult to know the exact dependence structure between parameters, bootComb allows running sensitivity analyses to assess the impact of parameter dependence on the resulting confidence interval for the combined parameter.

Availability

bootComb is available from the Comprehensive R Archive Network (https://CRAN.R-project.org/package=bootComb).

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2022. Published by Cambridge University Press

Introduction

The bootcomb R package (Henrion, Reference Henrion2021, Reference Henrion2022) was recently published. This package for the statistical computing environment R (R Core Team, 2021) allows researchers to derive confidence intervals (CIs) with correct coverage for combinations of independently estimated parameters. Important applications include adjusting a prevalence for estimated test sensitivity and specificity (e.g., Mandolo et al., Reference Mandolo, Henrion, Mhango, Chinyama, Wachepa, Kanjerwa, Malamba-Banda, Shawa, Hungerford, Kamng’ona, Iturriza-Gomara, Cunliffe and Jere2021) or combining conditional prevalence estimates (e.g., Stockdale et al., Reference Stockdale, Kreuels, Henrion, Giorgi, Kyomuhangi, de Martel, Hutin and Geretti2020).

Briefly, for each of the input parameters, bootComb finds a best-fit parametric distribution based on the confidence interval for that parameter estimate. bootComb then uses the parametric bootstrap to sample many sets of parameter estimates from these best-fit distributions and computes the corresponding combined parameter estimate for each set. This builds up an empirical distribution of parameter estimates for the combined parameter. Finally, bootComb uses either the percentile or the highest density interval method to derive a confidence interval for the combined parameter estimate. Full details of the algorithm are given by Henrion (Reference Henrion2021).

A key point of the algorithm is that the best-fit distributions for the different parameters are sampled independently. This requires the parameters to be independent. This may not be a realistic assumption in some real-world applications.

While for most practical applications the input parameters are typically estimated from independent experiments (otherwise the combined parameter could be directly estimated), the parameters themselves may not be independent. This is for instance the case when adjusting a prevalence for the diagnostic test’s sensitivity and specificity. The latter two parameters are not independent: higher sensitivity can be achieved by lowering specificity and vice versa.

If the experiments estimating these parameters are sufficiently large, then the violation of the assumption of parameter independence may only have a negligible impact on the resulting confidence interval for the combined parameter. However, for the sake of general applicability and to allow running sensitivity analyses, the author felt it was beneficial to extend bootComb to handle dependent parameters.

Methods

Copulas are multivariate distribution functions where the marginal probability distribution of each variable is the uniform distribution on the interval $ \left[0,1\right] $ . Copulas allow to specify the intercorrelation between random variables. An important probability theory result, Sklar’s Theorem (Sklar, Reference Sklar1959), states that any multivariate probability distribution can be expressed in terms of its univariate marginal distributions and a copula defining the dependence between the variables.

Mathematically, let $ {X}_1,\hskip0.35em {X}_2\dots, \hskip0.35em {X}_d $ be $ d $ random variables and define $ {U}_i={F}_i\left({X}_i\right),\hskip0.35em i=1,\dots, \hskip0.35em d $ . Then the copula $ C $ of $ \left({X}_1,\dots, {X}_d\right) $ is defined as the joint cumulative distribution function of $ \left({U}_1,\dots, {U}_d\right) $ :

(1) $$ C\left({u}_1,\dots, {u}_d\right)=\mathit{\Pr}\left({U}_1\le {u}_1,\dots, {U}_d\le {u}_d\right) $$

Assume that the marginal distributions, $ {F}_i(x)=\mathit{\Pr}\left[{X}_i\le x\right],\hskip0.35em i=1,\dots, \hskip0.35em d $ are continuous. Then, via the probability integral transform (Angus, Reference Angus1994), the random vector $ \left({U}_1,{U}_2,\dots, {U}_d\right) $ has marginals that are uniformly distributed on $ \left[0,1\right] $ .

bootComb makes use of the fact that the above can be reversed: given a sample $ \left({u}_1,\dots, {u}_d\right) $ , a sample for $ \left({X}_1,\dots, {X}_d\right) $ can be obtained by $ \left({x}_1,\dots, {x}_d\right)=\left({F}_1^{-1}\left({u}_1\right),\dots, {F}_d^{-1}\left({u}_d\right)\right) $ . The inverse functions $ {F}_i^{-1}(u) $ will be defined if the marginals $ {F}_i(x) $ are continuous. For the use of bootComb, where users input confidence intervals for an estimated numeric parameter, this will always be the case.

bootComb will proceed as follows to generate samples from a multivariate distribution of $ d $ dependent variables:

  • Estimate best-fit distributions $ {F}_1,\dots, \hskip0.35em {F}_d $ for each of the $ d $ parameters $ {X}_1,\dots, \hskip0.35em {X}_d $ given the lower and upper limits of the estimated confidence intervals for each parameter.

  • Sample $ \left({z}_1,\dots, {z}_d\right) $ from a multivariate normal distribution $ \mathcal{N}\left(0,\varSigma \right) $ , where the variances in $ \varSigma $ are all 1.

  • Since the marginals of this normal distribution are all $ \mathcal{N}\left(0,1\right) $ , compute $ {u}_i=\varPhi \left({z}_i\right) $ where $ \varPhi $ is the cumulative distribution function of the standard normal.

  • Finally, for each $ i=1,\dots, \hskip0.35em d $ , compute $ {x}_i={F}_i^{-1}\left({u}_i\right) $ where $ {F}_i $ is the best-fit marginal distribution of parameter $ i $ .

The resulting vector $ \left({x}_1,\dots, {x}_d\right) $ will be a sample from the multivariate distribution of $ \left({X}_1,\dots, {X}_d\right) $ . Note that the dependence structure was completely specified through the covariance matrix $ \varSigma $ (since the variances are assumed to be 1, this really is a correlation matrix) and marginal distributions for each parameter were specified by $ {F}_i,i=1,\dots, \hskip0.35em d $ .

Results

I repeat the two examples from Henrion (Reference Henrion2021) here, but look at the effect of specifying a dependence between the input parameters.

All examples below use the highest density interval (HDI) method (input argument method = “hdi”) to derive the final confidence interval. Whether this or the percentile method is used is a user choice. The HDI-derived interval will be the narrowest interval with the desired coverage and the probability density will always be higher within that interval than outside it. To note however that the HDI may not be a single interval but a set of intervals if the density is multimodal. In this case, the single interval returned by bootComb will be too wide. For this reason, users should always inspect the histogram of the sampled combined parameter when using the HDI method.

HDV prevalence in the general population

With an application to hepatitis D and B viruses (HDV and HBV, respectively) from Henrion (Reference Henrion2021), Stockdale et al. (Reference Stockdale, Kreuels, Henrion, Giorgi, Kyomuhangi, de Martel, Hutin and Geretti2020) showed how to use bootComb to obtain a valid confidence interval for $ {\hat{p}}_{aHDV} $ , the prevalence of HDV-specific immunoglobulin G antibodies (anti-HDV) in the general population.

HBV is a pre-condition for HDV and hence to derive $ {\hat{p}}_{aHDV} $ Stockdale et al. (Reference Stockdale, Kreuels, Henrion, Giorgi, Kyomuhangi, de Martel, Hutin and Geretti2020), obtained estimates of the prevalence of surface antigen of the hepatitis B virus (HBsAg), $ {\hat{p}}_{HBsAg}=3.5\% $ , and the conditional prevalence of anti-HDV given the presence of HBsAg, $ {\hat{p}}_{aHDV\mid HBsAg}= $ 4.5%:

  • $ {\hat{p}}_{HBsAg}=3.5 $ with 95% CI $ (2.7\%,\hskip2pt 5.0\%) $ .

  • $ {\hat{p}}_{aHDV\mid HBsAg}=4.5 $ with 95% CI $ (3.6\%,\hskip2pt 5.7\%) $ .

Assuming these two parameters to be independent, Henrion (Reference Henrion2021) derived a 95% confidence interval for the estimate $ {\hat{p}}_{aHDV}={\hat{p}}_{aHDV\mid HBsAg}\cdotp {\hat{p}}_{HBsAg} $ using bootComb $ (0.11\%,\hskip2pt 0.25\%) $ .

If, however, the two input prevalences are not independent, for example, if anti-HDV is more common among people with the presence of HBsAg the higher the population prevalence of HBsAg is, then that assumption of independence would not hold. We can investigate how strong an effect dependence of the parameters can have on the resulting confidence estimate. For example, let us run the same example using bootComb with specifying the following covariance matrix for the bivariate normal copula:

(2) $$ \varSigma =\left(\begin{array}{rr}1\hskip1.5em &\;0.5\\ {}0.5\;& \hskip1em 1\end{array}\right) $$

This yields the 95% confidence interval $ (0.10\%,\hskip2pt 0.26\%) $ , a slightly wider interval—which makes sense, as the positive correlation means it is more likely for pairs of bootstrapped input parameters to be both near the upper (respectively lower) end of their confidence intervals.

For this particular application, a dependence between both prevalence parameters, $ {\hat{p}}_{HBsAg} $ and $ {\hat{p}}_{aHDV\mid HBsAg} $ , is unlikely and I have therefore not considered this example any further.

SARS-CoV-2 seroprevalence adjusted for test sensitivity and specificity

Henrion (Reference Henrion2021) gave an example of adjusting an estimated SARS-CoV-2 seroprevalence for the estimated sensitivity and specificity of the test assay. Specifically,

  • Eighty-four out of 500 study participants tested positive for SARS-CoV-2 antibodies, yielding a seroprevalence estimate $ {\hat{\pi}}_{raw}=16.8\% $ with exact binomial 95% CI $ (13.6\%,\hskip2pt 20.4\%) $ .

  • Estimated assay sensitivity: 238 out of 270 known positive samples tested positive $ {\hat{p}}_{sens}=88.1\% $ , 95% CI $ (83.7\%,\hskip2pt 91.8\%) $ .

  • Estimated assay specificity: 82 out of 88 known negative samples tested negative $ {\hat{p}}_{spec}=93.2\% $ , 95% CI $ (85.7\%,\hskip2pt 97.5\%) $ .

Assuming the sensitivity and specificity to be independent, Henrion (Reference Henrion2021) reported an adjusted seroprevalence estimate $ \hat{\pi}=12.3\% $ with 95% CI $ (3.9\%,\hskip2pt 19.0\%) $ .

However, in this case, the assumption of independence is not fully realistic: there is a trade-off between sensitivity and specificity of the test assay, and as such one would expect a negative dependence between the two parameters: sensitivity can be increased at the cost of decreased specificity and vice versa.

Assuming that the sensitivity and specificity are negatively correlated with the copula correlation parameter $ \rho =-0.5 $ between these two parameters, using the extension of bootComb we can now account for the dependence of the parameters:

The reported confidence interval is now $ (3.8\%,\hskip2pt 19.4\%) $ —marginally wider than when the dependence was ignored.

If we additionally specify returnBootVals = TRUE in the function call, we can extract and plot the sampled pairs of sensitivity and specificity values to check the dependence structure. This is shown in Figure 1: as the correlation parameter $ \rho $ in the copula between the sensitivity and specificity is decreased from 0 to −1, the dependence between both parameters becomes more and more pronounced as one would expect.

Figure 1. Scatterplots showing the bootstrapped values of sensitivity and specificity for different strengths of dependence (from independence to perfect correlation) between sensitivity and specificity. The empirical kernel density estimate for the bivariate distribution in each case is shown as orange contour lines.

This shows that a simple correlation matrix specified for the Gaussian copula results in this case in a non-trivial dependence structure between two beta-distributed variables, respecting the specified marginal distributions.

We can also visualize the effect on the estimated confidence interval, as shown in Figure 2. We can see that in this case, with a negative correlation, the width of the CI increases at the correlation becomes stronger. However, looking at the scale of the y-axis we see that this is largely just a marginal effect.

Figure 2. Width of the estimated confidence interval as a function of increased strength of the negative correlation between sensitivity and specificity.

A more substantial effect of parameter dependence is obtained when we also allow the measured prevalence $ {\pi}_{raw} $ to be correlated with sensitivity ( $ {p}_{sens} $ ; positive correlation) and specificity ( $ {p}_{spec} $ ; negative correlation). Specifically, we can specify the following correlation matrix for the parameters $ \left({\pi}_{raw},{p}_{sens},{p}_{spec}\right) $ :

(3) $$ \varSigma =\left(\begin{array}{rrr}1\hskip1.5em &\;0.3& -0.3\\ {}0.3\;& \hskip1em 1& -0.5\\ {}-0.3& -0.5& 1\end{array}\right) $$

In this case, the reported confidence interval is $ (4.7\%,\hskip2pt 18.2\%) $ . This CI is, in relative terms, 11% narrower than when the dependence structure was ignored—a substantial effect for practical purposes.

Conclusions

The R package bootComb has been extended and, using Gaussian copulas, it can now handle the case of dependent input parameters. For many applications, the effect of dependence between the parameters will be marginal or even negligible, but this is not always the case. The package now allows users to do sensitivity analyses to assess the effects of a miss-specified dependence structure between the parameters that are being combined.

At the time of publication, the most recent version of bootComb was 1.1.2.

Data availability statement

All data to support this work are contained within the article. The software package itself is available from https://cran.r-project.org/package=bootComb.

Funding statement

This research was funded in whole, or in part, by the Wellcome Trust [grant: 206545/Z/17/Z]. For the purpose of open access, the author has applied a CC BY public copyright license to any Author Accepted Manuscript version arising from this submission.

Conflict of interest

The author has no conflicts of interest to declare.

References

Angus, J. E. (1994). The probability integral transform and related results. SIAM Review, 36, 652654.CrossRefGoogle Scholar
Henrion, M. Y. (2021). bootComb—an R package to derive confidence intervals for combinations of independent parameter estimates. International Journal of Epidemiology, 50, 10711076. https://doi.org/10.1093/ije/dyab049 accessed: 23rd August 2022CrossRefGoogle Scholar
Henrion, M. Y. (2022). bootComb: Combine Parameter Estimates via Parametric Bootstrap (R package version 1.1.2) [Computer software]. https://cran.r-project.org/package=bootCombGoogle Scholar
Mandolo, J. J., Henrion, M. Y. R., Mhango, C., Chinyama, E., Wachepa, R., Kanjerwa, O., Malamba-Banda, C., Shawa, I. T., Hungerford, D., Kamng’ona, A. W., Iturriza-Gomara, M., Cunliffe, N. A., & Jere, K. C. (2021). Reduction in severity of all-cause gastroenteritis requiring hospitalisation in children vaccinated against rotavirus in Malawi. Viruses, 13, 2491. https://doi.org/10.3390/v13122491CrossRefGoogle ScholarPubMed
R Core Team. (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/ accessed: 23rd August 2022Google Scholar
Sklar, M. (1959). Fonctions de Répartition À N Dimensions Et Leurs Marges (pp. 229–231). Publications de l’Institut Statistique de l’Université de Paris.Google Scholar
Stockdale, A. J., Kreuels, B., Henrion, M. Y. R., Giorgi, E., Kyomuhangi, I., de Martel, C., Hutin, Y., & Geretti, A. M. (2020). The global prevalence of hepatitis D virus infection: Systematic review and meta-analysis. Journal of Hepatology, 73, 523532. https://doi.org/10.1016/j.jhep.2020.04.008CrossRefGoogle ScholarPubMed
Figure 0

Figure 1. Scatterplots showing the bootstrapped values of sensitivity and specificity for different strengths of dependence (from independence to perfect correlation) between sensitivity and specificity. The empirical kernel density estimate for the bivariate distribution in each case is shown as orange contour lines.

Figure 1

Figure 2. Width of the estimated confidence interval as a function of increased strength of the negative correlation between sensitivity and specificity.

Reviewing editor:  Charles Doran University of Alberta, Mathematical and Statistical Sciences, Edmonton, Canada, T6G 2R3
Minor revisions requested.

Review 1: Relaxation of the parameter independence assumption in the bootComb R package

Conflict of interest statement

Reviewer declares none.

Comments

Comments to the Author: The value the author gives to the relaxation of the parameter independence assumption in the bootComb R package is clearly expressed, firstly for the sake of completeness and secondly for the sake of integrity. However, for many applications, the effect on confidence intervals is ‘marginal or even negligible’ (195, 196). It is understood that, ‘in practice it may be difficult to know the exact dependence structure between parameters’ (34); but could perhaps an example of a case where the corrected confidence interval differs substantially from the uncorrected be included in the results to demonstrate the potential applied importance of this work?

I think it would be helpful to the reader if the circumstances under which the percentile or highest density interval method would be used was made briefly explicit (58, 59, 60) despite this already being stated in Henrion (2021).

I found a small number of potential typing errors in the text. For example, ‘covariances are assumed to be one’ not ‘variances’ (106), ‘let ’s ’ instead of ‘let us ’ (129) and ‘we ’ instead of ‘I ’ (149).

Thank you, a very exciting paper.

Presentation

Overall score 4.6 out of 5
Is the article written in clear and proper English? (30%)
5 out of 5
Is the data presented in the most useful manner? (40%)
4 out of 5
Does the paper cite relevant and related articles appropriately? (30%)
5 out of 5

Context

Overall score 4.8 out of 5
Does the title suitably represent the article? (25%)
5 out of 5
Does the abstract correctly embody the content of the article? (25%)
5 out of 5
Does the introduction give appropriate context? (25%)
4 out of 5
Is the objective of the experiment clearly defined? (25%)
5 out of 5

Analysis

Overall score 4.6 out of 5
Does the discussion adequately interpret the results presented? (40%)
4 out of 5
Is the conclusion consistent with the results and discussion? (40%)
5 out of 5
Are the limitations of the experiment as well as the contributions of the experiment clearly outlined? (20%)
5 out of 5

Review 2: Relaxation of the parameter independence assumption in the bootComb R package

Conflict of interest statement

Reviewer declares none.

Comments

Comments to the Author: Paper was clear and straightforward. The sensitivity analysis seemed to demonstrate dependence of the parameters had only very minor effects on the CI width. It would be nice to see an example where this is less negligible.

Presentation

Overall score 4.2 out of 5
Is the article written in clear and proper English? (30%)
5 out of 5
Is the data presented in the most useful manner? (40%)
3 out of 5
Does the paper cite relevant and related articles appropriately? (30%)
5 out of 5

Context

Overall score 5 out of 5
Does the title suitably represent the article? (25%)
5 out of 5
Does the abstract correctly embody the content of the article? (25%)
5 out of 5
Does the introduction give appropriate context? (25%)
5 out of 5
Is the objective of the experiment clearly defined? (25%)
5 out of 5

Analysis

Overall score 4.6 out of 5
Does the discussion adequately interpret the results presented? (40%)
4 out of 5
Is the conclusion consistent with the results and discussion? (40%)
5 out of 5
Are the limitations of the experiment as well as the contributions of the experiment clearly outlined? (20%)
5 out of 5