Introduction
Statistical hypotheses testing is an essential approach adopted for medical and healthcare data analysis [Reference Guyatt1]. Student's t test is one of the crucial tests that is widely used to conduct statistical inference for normally (or approximately normally) distributed dataset or those with sufficiently large sample size when the central limit theorem (CLT) is applicable [Reference Bland2, Reference Kirkwood and Sterne3]. Student's t test may yield unsatisfactory testing outcome when samples are skewed [Reference Wilcox4], mostly likely with small sample size. Bootstrap methods have been proposed in 1970s, and have been used to analyse such as not normally distributed data [Reference Efron, Kotz and Johnson5, Reference Dwivedi, Mallawaarachchi and Alvarado6]. It is (asymptotically) more accurate than the standard estimates using sample variance and based on the assumptions of normality [Reference DiCiccio and Efron7, Reference Carpenter and Bithell8]. Although a bootstrap t test was proposed by Efron and Tibshirani in 1993 [Reference Efron and Tibshirani9], it was considered the percentile of bootstrapped test statistic samples at the significant level. To avoid repetition, we omit the algorithm of bootstrap t test in this study since the detailed algorithm was already introduced in [Reference Efron and Tibshirani9]. This improved version of t test is seldom adopted in medical research.
Objectives
As mentioned, it is commonly accepted to use Student's t test when normality of the data suffices, whereas the bootstrap approach could be adopted to resolve the situation without normality. In this study, we demonstrated that for data from normal population, the bootstrap t test outperforms Student's t test in terms of different measures of the testing accuracy. We explored the general features of the data sample with which bootstrap t tests are likely to have more plausible testing outcome.
Methods
The details of the testing procedures of bootstrap t test could be found in [Reference Efron and Tibshirani9]. The pairwise two-sample t tests are conducted based on the null hypothesis, H0, that assumes the means of the two populations equal. Data samples are randomly generated from normally distributed populations, which will be used to compare the testing outcome based on data samples and the facts of populations. We evaluated the testing performance in two scenarios including
• scenario (i): the H0 is true; and
• scenario (ii): the H0 is false.
Then, the possibility that H0 was not rejected in scenario (i) is the true-positive rate (TPR), i.e. sensitivity. The possibility that H0 was rejected in scenario (ii) is the true-negative rate (TNR), i.e. specificity. Theoretically, the TPR is (1 − α), where α is known as the rate of the type I error, i.e. false-alarm rate, and similarly, TNR is (1 − β), where β is the rate of the type II error, i.e. miss rate. It is a common practice to set α at 5%, and the test is formulated with the aim to minimise β [Reference Guyatt1, Reference Anderson, Burnham and Thompson10].
Fixed TPR
With TPR = (1 − α) = 95%, i.e. α = 5%, we evaluated
(i) the consistency in TPR,
(ii) the levels of TNR and
(iii) informedness (i.e. Youden's J statistic)
of two types of t tests with varying sample size and coefficient of variation (CV) = s.d./difference in mean, in the samples [Reference Schisterman11]. Here, the informedness = TPR + TNR − 1, ranging from 0 to 1 (inclusive), is a single statistic that estimates the probability of an informed decision [Reference Youden12], which evaluates the performance of diagnostic tests. The informedness is 0 when a diagnostic test gives the same proportion of positive results for both true and false groups, which implies the testing outcome is totally uninformed. The informedness 1 indicates an ideal situation that TPR = TNR = 1, which implies that the testing outcome is perfectly informed. Since the test statistic of t test is mainly determined by CV and sample size, these two factors are thus included in the testing performance evaluation.
Varying TPR
With varying TPR, i.e. (1 − α), we could measure the diagnostic performance of both tests by using TPR and TNR in pairs. With all pairs of TPR and TNR, we could construct the receiver operating characteristic (ROC) curve to illustrate the diagnostic abilities of the two t tests in terms of the area under the curve (AUC).
Testing performance evaluation
For each set of sample size, CV and α, we tested 10 000 pairs of random-generated data samples to estimate the TPR and TNR, and then to calculate the informedness and AUC. We ran 1000 bootstrap samples to conduct the bootstrap t test. We also ran 1000 bootstrap samples in the testing outcomes of the two t test to generate the 95% confidence intervals (CIs) of the estimated metrics.
For demonstration, we compare the testing outcomes by using the COVID-19 serial interval (SI), which is defined as the time interval between consecutive transmission generations, data in Shenzhen and Hong Kong, China. This demonstrative example is considered as a part of results (instead of methodology), and thus elaborated in the next section.
Results and discussion
We found that the informedness of bootstrap t test outperformed Student's t test for both a wide range of varying sample sizes and CVs, see Figure 1(a) and (b). Since the TPRs were consistently stabilised at 95%, see Figure 1(c) and (d), the difference in the informedness was due to the differences in the TNRs, see Figure 1(e) and (f). With fixed α, the bootstrap t test maintained the equivalent accuracy in TPR, but significantly improved the TNR compared to the Student's t test, see Figure 1(c)–(f). This can be interpreted as the bootstrap t test is more likely to exclude the unrealistic hypothesis, when H0 is false, compared to the Student's t test and meanwhile maintained its judgement to the true statement, when H0 is true. Since the null hypothesis is known a priori to be false [Reference Dushoff, Kain and Bolker13], H0 is commonly expected to be rejected based on sufficient (statistical) evidence [Reference Guyatt1, Reference Wilcox4, Reference Anderson, Burnham and Thompson10]. Thus, the improvement in TNR was remarkably desirable.
In Figure 2, the diagnostic ability of bootstrap t test outperformed or equivalently performed as Student's t test in terms of the AUC. The diagnostic ability of bootstrap t test outperformed Student's t test not only when the sample size is small, e.g. see Figure 2(b) and (c), but also when the sample size becomes large, e.g. see Figure 2(i) and (k). Although Student's t test can be conducted with sufficiently large sample size when the CLT is applicable [Reference Bland2, Reference Kirkwood and Sterne3], we found that bootstrap t test outperformed or equivalently performed as Student's t test regardless of the sample size.
On one hand, the AUC of Student's t test approached that of bootstrap t test, i.e. equivalent performance, when the sample size became larger and CV became smaller, e.g. see Figure 2(e) and (j). Under these circumstances, the distributions of samples to be tested are distinguishably separated, and thus straightforwardly, the two tests could yield ‘to reject H0’ outcomes equivalently. This finding indicated that given sufficiently large sample size, Student's t test was capable of achieving equivalent diagnostic ability as bootstrap t test when the two datasets were discriminative in the central tendency and had low dispersion. It is also interesting to note that the equivalent performance only appears when the values of the AUC of two tests equal to 0.5, i.e. random classifier, or 1, i.e. perfect classifier. Either AUC = 0.5 or AUC = 1 would rarely occur due to the unusual features of the testing datasets, e.g. extremely large sample size and small CV or extremely small sample size and large CV.
On the other hand, when the sample size is small and CV is large, e.g. see Figure 2(b), (c), (d), (g) and (h), the distributions of samples to be tested are difficult to differentiate. In these situations, the diagnostic ability of bootstrap t test outperformed Student's t test in terms of the AUC.
In summary, for data samples from normally distributed populations, both testing performance and diagnostic abilities of the bootstrap t test outperformed Student's t test regardless of varying sample size and CV. We have summarised our findings and the situation when normality fails in Table 1. Specially, for small samples, when data fail to meet normality assumption, other non-parametric tests and their bootstrap versions are also recommended to fit the study purpose.
Note: The ‘dispersion’ in this study is measured by the CV.
Demonstrative example of COVID-19
We demonstrate the performance of bootstrap t test against the Student's t test by using the COVID-19 SI dataset from the early outbreaks in Shenzhen and Hong Kong, two neighbour cities on the southeast coast of China. In infectious disease transmission, the SI is defined as the difference between the onset date of a secondary case and that of its associated primary case in a consecutive transmission chain [Reference Fine16]. With the pathogen's transmissibility fixed, a shorter SI implies that the disease may transmit more rapidly in terms of the epidemiological outcomes at the population scale, e.g. number of cases. The SI is one of the key epidemiological parameters to characterise the disease transmission process, and it is of importance in determining the changing patterns of the epidemic curve [Reference Wallinga and Lipsitch17–Reference Nishiura20]. The SI can be inferred from the contact tracing surveillance data and reconstruction of the transmission chains, which is well studied in previous studies [Reference Xu21–Reference Cowling33], and widely adopted in modelling analysis [Reference Chinazzi34–Reference Zhao47].
The SI data were collected via the public domains until 22 February 2020 for Shenzhen, and until 15 February 2020 for Hong Kong. The study periods cover the major epidemic wave in Shenzhen and the first-epidemic wave in Hong Kong. This dataset was published previously in [Reference Wang48, Reference Zhao49] as well as studied in [Reference Zhao50]. We extract transmission pairs, i.e. one secondary case is epidemiologically linked to one and only one primary case, with no missing information of the primary case's sex. We obtained a total of 34 transmission pairs including 22 (14 male and 8 female primary cases) from Shenzhen, and 12 (6 male and 6 female primary cases) from Hong Kong. There were 33 (out of a total of 34) transmission pairs with primary cases' symptoms onset date in January 2020, see Figure 3.
We evaluate the two t tests by examining whether they are able to identify the difference in COVID-19 SI due to sex and non-pharmaceutical interventions (NPIs). Thus, we conduct the t test on two groups of SI samples separated from the original dataset based on two epidemiological evidences. They include
• evidence (i): according to the previous studies [Reference Ma28, Reference Zhao50], a female COVID-19 primary case is likely having longer SI than male; and
• evidence (ii): due to non-NPIs, e.g. social distancing, city lockdown, travel suspension, wearing face mask, regular sterilisation, the SI was shortened, i.e. became smaller, over time [Reference Ali31, Reference Zhao50].
Hence, we divide the COVID-19 SI samples based on the sex of primary case, and Chinese Lunar New Year (CLNY) from 23 to 26 January 2020 [Reference Leung, Cowling and Wu51], after which most of the NPIs (including city lockdown) were implemented and enhanced. Two groups of SI samples are selected for the t tests. They are
• samples from population (i): SI samples with female primary case whose symptoms onset was before CLNY (sample size is 3, see red dots in Fig. 3), and
• samples from population (ii): SI samples with male primary cases whose symptoms onset was after CLNY (sample size is 10, see blue dots in Fig. 3).
Straightforwardly, the mean SI of population (i) is expected higher than the mean SI of population (ii), which is also supported by the evidence found in previous studies [Reference Ma28, Reference Ali31, Reference Zhao50].
As for the outcomes from the t tests, we report the one-side bootstrap t test yields a P value = 0.04 of statistical significance, whereas the one-side Student's t test yields a P value = 0.05. Therefore, we demonstrate that the bootstrap t test outperforms the Student's t test by successfully detecting the difference in COVID-19 SI due to sex and NPIs.
Limitations
This comparison analysis study has limitations. As one of the classic drawbacks mentioned in [Reference Athreya52], for the bootstrap on samples from a population without a finite variance, the bootstrap will be unlikely to converge. However, medical data samples are (commonly) from the real-world samples and thus the variance are expected to be finite. Although we have demonstrated the testing performances by using large sets of randomly generated data samples, the study would benefit from real-world examples that have different conclusions from the bootstrap t test and Student's t test, respectively.
Conclusions
We demonstrated that the bootstrap t test outperforms Student's t test, and it is recommended to replace Student's t test in medical data analysis regardless of sample size.
Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/S0950268821001047
Acknowledgement
The authors thank G. Yang and Y. Han, both from the Chinese University of Hong Kong, for their helpful discussion at the very early stage of this study.
Author contributions
SZ conceptualised the study, conducted the analysis, drafted the manuscript and critically revised the contents. SZ, ZY and DH discussed the results. All authors had full access to the data, contributed to the study, approved the final version for publication and take responsibility for its accuracy and integrity.
Financial support
DH was supported by General Research Fund (Grant Number 15205119) of the Research Grants Council (RGC) of Hong Kong, China, and an Alibaba (China) Co., Ltd. Collaborative Research grant.
Conflict of interest
DH received support from an Alibaba (China) Co., Ltd. Collaborative Research grant. MHW is a shareholder of Beth Bioinformatics Co., Ltd. Other authors have no conflict of interest.
Ethical standards
There was no experiment conducted. This research was based on the computer simulation, and publicly available dataset. Hence, the ethics approval was not applicable.
Data availability statement
The COVID-19 data used in this study are available via the Supplementary materials.