Impact of selection bias on polygenic risk score estimates in healthcare settings

Younga Heather Lee; Tanayott Thaweethai; Yi-Han Sheu; Yen-Chen Anne Feng; Elizabeth W. Karlson; Tian Ge; Peter Kraft; Jordan W. Smoller

doi:10.1017/S0033291723001186

Impact of selection bias on polygenic risk score estimates in healthcare settings

Published online by Cambridge University Press: 25 May 2023

Younga Heather Lee

Tanayott Thaweethai ,

Yi-Han Sheu ,

Yen-Chen Anne Feng ,

Elizabeth W. Karlson ,

Tian Ge ,

Peter Kraft and

Jordan W. Smoller

Show author details

Younga Heather Lee: Affiliation:
Psychiatric and Neurodevelopmental Genetics Unit, Center for Genomic Medicine, Massachusetts General Hospital, Boston, Massachusetts, USA Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA Harvard Medical School, Boston, Massachusetts, USA
Tanayott Thaweethai: Affiliation:
Harvard Medical School, Boston, Massachusetts, USA Biostatistics Center, Massachusetts General Hospital, Boston, Massachusetts, USA
Yi-Han Sheu: Affiliation:
Psychiatric and Neurodevelopmental Genetics Unit, Center for Genomic Medicine, Massachusetts General Hospital, Boston, Massachusetts, USA Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA Harvard Medical School, Boston, Massachusetts, USA
Yen-Chen Anne Feng: Affiliation:
Psychiatric and Neurodevelopmental Genetics Unit, Center for Genomic Medicine, Massachusetts General Hospital, Boston, Massachusetts, USA Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA Harvard Medical School, Boston, Massachusetts, USA Analytic and Translational Genetics Unit, Center for Genomic Medicine, Massachusetts General Hospital, Boston, Massachusetts, USA Division of Biostatistics and Data Science, Institute of Epidemiology and Preventive Medicine, National Taiwan University, Taipei, Taiwan
Elizabeth W. Karlson: Affiliation:
Harvard Medical School, Boston, Massachusetts, USA Division of Rheumatology, Immunity, and Inflammation, Department of Medicine, Brigham and Women's Hospital, Boston, Massachusetts, USA
Tian Ge: Affiliation:
Psychiatric and Neurodevelopmental Genetics Unit, Center for Genomic Medicine, Massachusetts General Hospital, Boston, Massachusetts, USA Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA Harvard Medical School, Boston, Massachusetts, USA Center for Precision Psychiatry, Massachusetts General Hospital, Boston, Massachusetts, USA
Peter Kraft: Affiliation:
Department of Epidemiology, Harvard T. H. Chan School of Public Health, Boston, Massachusetts, USA Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, Massachusetts, USA
Jordan W. Smoller*: Affiliation:
Psychiatric and Neurodevelopmental Genetics Unit, Center for Genomic Medicine, Massachusetts General Hospital, Boston, Massachusetts, USA Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA Harvard Medical School, Boston, Massachusetts, USA Center for Precision Psychiatry, Massachusetts General Hospital, Boston, Massachusetts, USA
*: Corresponding author: Jordan W. Smoller; Email: [email protected]

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Background

Hospital-based biobanks are being increasingly considered as a resource for translating polygenic risk scores (PRS) into clinical practice. However, since these biobanks originate from patient populations, there is a possibility of bias in polygenic risk estimation due to overrepresentation of patients with higher frequency of healthcare interactions.

Methods

PRS for schizophrenia, bipolar disorder, and depression were calculated using summary statistics from the largest available genomic studies for a sample of 24 153 European ancestry participants in the Mass General Brigham (MGB) Biobank. To correct for selection bias, we fitted logistic regression models with inverse probability (IP) weights, which were estimated using 1839 sociodemographic, clinical, and healthcare utilization features extracted from electronic health records of 1 546 440 non-Hispanic White patients eligible to participate in the Biobank study at their first visit to the MGB-affiliated hospitals.

Results

Case prevalence of bipolar disorder among participants in the top decile of bipolar disorder PRS was 10.0% (95% CI 8.8–11.2%) in the unweighted analysis but only 6.2% (5.0–7.5%) when selection bias was accounted for using IP weights. Similarly, case prevalence of depression among those in the top decile of depression PRS was reduced from 33.5% (31.7–35.4%) to 28.9% (25.8–31.9%) after IP weighting.

Conclusions

Non-random selection of participants into volunteer biobanks may induce clinically relevant selection bias that could impact implementation of PRS in research and clinical settings. As efforts to integrate PRS in medical practice expand, recognition and mitigation of these biases should be considered and may need to be optimized in a context-specific manner.

Keywords

selection bias polygenic risk score biobank inverse probability weighting causal inference

Type: Original Article
Information: Psychological Medicine , Volume 53 , Issue 15 , November 2023 , pp. 7435 - 7445

DOI: https://doi.org/10.1017/S0033291723001186 [Opens in a new window]
Copyright: Copyright © The Author(s), 2023. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Allen, N. L., Karlson, E. W., Malspeis, S., Lu, B., Seidman, C. E., & Lehmann, L. S. (2014). Biobank participants’ preferences for disclosure of genetic research results: Perspectives from the OurGenes, OurHealth, OurCommunity project. Mayo Clinic Proceedings. Mayo Clinic, 89(6), 738–746. doi:10.1016/j.mayocp.2014.03.015CrossRef Google Scholar PubMed

Bayramli, I., Castro, V., Barak-Corren, Y., Madsen, E. M., Nock, M. K., Smoller, J. W., & Reis, B. Y. (2021). Temporally informed random forests for suicide risk prediction. Journal of the American Medical Informatics Association: JAMIA, 29(1), 62–71. doi:10.1093/jamia/ocab225CrossRef Google Scholar PubMed

Beesley, L. J., & Mukherjee, B. (2022). Statistical inference for association studies using electronic health records: handling both selection bias and outcome misclassification. Biometrics, 78(1), 214–226. doi:10.1111/biom.13400CrossRef Google Scholar PubMed

Bigdeli, T. B., Voloudakis, G., Barr, P. B., Gorman, B. R., Genovese, G., & Peterson, R. E., … Cooperative Studies Program (CSP) #572 and Million Veteran Program (MVP). (2022). Penetrance and pleiotropy of polygenic risk scores for schizophrenia, bipolar disorder, and depression among adults in the US veterans affairs health care system. JAMA Psychiatry, 79(11), 1092–1101. doi:10.1001/jamapsychiatry.2022.2742CrossRef Google Scholar PubMed

Boutin, N. T., Schecter, S. B., Perez, E. F., Tchamitchian, N. S., Cerretani, X. R., Gainer, V. S., … Smoller, J. W. (2022). The evolution of a large biobank at Mass General Brigham. Journal of Personalized Medicine, 12(8), 1323. doi:10.3390/jpm12081323CrossRef Google Scholar PubMed

Carroll, R. J., Bastarache, L., & Denny, J. C. (2014). R PheWAS: Data analysis and plotting tools for phenome-wide association studies in the R environment. Bioinformatics, 30(16), 2375–2376. doi:10.1093/bioinformatics/btu197CrossRef Google Scholar PubMed

Castro, V. M., Gainer, V., Wattanasin, N., Benoit, B., Cagan, A., Ghosh, B., … Murphy, S. N. (2021). The Mass General Brigham Biobank Portal: An i2b2-based data repository linking disparate and high-dimensional patient data to support multimodal analytics. Journal of the American Medical Informatics Association: JAMIA, 29(4), 643–651. doi:10.1093/jamia/ocab264CrossRef Google Scholar

Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Retrieved from http://arxiv.org/abs/1603.02754.Google Scholar

Cole, S. R., & Hernán, M. A. (2008). Constructing inverse probability weights for marginal structural models. American Journal of Epidemiology, 168(6), 656–664. doi:10.1093/aje/kwn164CrossRef Google Scholar PubMed

Electronic Medical Records and Genomics (eMERGE) Network. (n.d.). Retrieved from 29 April 2021 https://www.genome.gov/Funded-Programs-Projects/Electronic-Medical-Records-and-Genomics-Network-eMERGE.Google Scholar

Epic Systems Corporation. (n.d.). Epic electronic health record. Verona, WI.Google Scholar

Fry, A., Littlejohns, T. J., Sudlow, C., Doherty, N., Adamska, L., Sprosen, T., … Allen, N. E. (2017). Comparison of sociodemographic and health-related characteristics of UK biobank participants with those of the general population. American Journal of Epidemiology, 186(9), 1026–1034. doi:10.1093/aje/kwx246CrossRef Google Scholar PubMed

Ge, T., Chen, C.-Y., Ni, Y., Feng, Y.-C. A., & Smoller, J. W. (2019). Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nature Communications, 10(1), 1776. doi:10.1038/s41467-019-09718-5CrossRef Google Scholar PubMed

Goldstein, B. A., Bhavsar, N. A., Phelan, M., & Pencina, M. J. (2016). Controlling for informed presence bias due to the number of health encounters in an electronic health record. American Journal of Epidemiology, 184(11), 847–855. doi:10.1093/aje/kww112CrossRef Google Scholar

Haneuse, S., Arterburn, D., & Daniels, M. J. (2021). Assessing missing data assumptions in EHR-based studies: A complex and underappreciated task. JAMA Network Open, 4(2), e210184. doi:10.1001/jamanetworkopen.2021.0184CrossRef Google Scholar PubMed

Haneuse, S., & Daniels, M. (2016). A general framework for considering selection bias in EHR-based studies: What data are observed and why? EGEMS, 4(1), 1203. doi:10.13063/2327-9214.1203CrossRef Google Scholar PubMed

Hernán, M. A., Hernández-Díaz, S., & Robins, J. M. (2004). A structural approach to selection bias. Epidemiology, 15(5), 615–625. Retrieved from https://www.ncbi.nlm.nih.gov/pubmed/15308962.CrossRef Google Scholar PubMed

Howard, D. M.Adams, M. J., Clarke, T. K., Hafferty, J. D., Gibson, J., Shirali, M., … McIntosh, A. M. (2019). Genome-wide meta-analysis of depression identifies 102 independent variants and highlights the importance of the prefrontal brain regions. Nature Neuroscience, 22(3), 343–352. 10.1038/s41593-018-0326-7.CrossRef Google Scholar PubMed

Karlson, E. W., Boutin, N. T., Hoffnagle, A. G., & Allen, N. L. (2016). Building the partners HealthCare biobank at partners personalized medicine: Informed consent, return of research results, recruitment lessons and operational considerations. Journal of Personalized Medicine, 6(1). doi:10.3390/jpm6010002CrossRef Google Scholar PubMed

Khera, A. V., Chaffin, M., Wade, K. H., Zahid, S., Brancale, J., Xia, R., … Kathiresan, S. (2019). Polygenic prediction of weight and obesity trajectories from birth to adulthood. Cell, 177(3), 587–596.e9. doi:10.1016/j.cell.2019.03.028CrossRef Google Scholar PubMed

Khera, A. V., Emdin, C. A., Drake, I., Natarajan, P., Bick, A. G., Cook, N. R., … Kathiresan, S. (2016). Genetic risk, adherence to a healthy lifestyle, and coronary disease. The New England Journal of Medicine, 375(24), 2349–2358. doi:10.1056/NEJMoa1605086CrossRef Google Scholar PubMed

Läll, K., Mägi, R., Morris, A., Metspalu, A., & Fischer, K. (2017). Personalized risk prediction for type 2 diabetes: The potential of genetic risk scores. Genetics in Medicine, 19(3), 322–329. doi:10.1038/gim.2016.103CrossRef Google Scholar PubMed

Landry, L. G., Ali, N., Williams, D. R., Rehm, H. L., & Bonham, V. L. (2018). Lack of diversity in genomic databases is a barrier to translating precision medicine research into practice. Health Affairs, 37(5), 780–785. doi:10.1377/hlthaff.2017.1595CrossRef Google Scholar PubMed

Leppig, K. A., Kulchak Rahm, A., Appelbaum, P., Aufox, S., Bland, S. T., Buchanan, A., … Wiesner, G. L. (2022). The reckoning: The return of genomic results to 1444 participants across the eMERGE3 network. Genetics in Medicine, 24(5), 1130–1138. doi:10.1016/j.gim.2022.01.015CrossRef Google Scholar PubMed

Lewis, C. M., & Vassos, E. (2017). Prospects for using risk scores in polygenic medicine. Genome Medicine, 9(1), 96. doi:10.1186/s13073-017-0489-yCrossRef Google Scholar PubMed

Lumley, T. (2021). survey: Analysis of complex survey samples (Version 4.1-1). Retrieved from University of Auckland website. Retrieved from http://r-survey.r-forge.r-project.org/survey/.Google Scholar

Lundberg, S., & Lee, S.-I. (2017). A unified approach to interpreting model predictions. Retrieved from http://arxiv.org/abs/1705.07874.Google Scholar

Madden, J. A., Brothers, K. K., Williams, J. L., Myers, M. F., Leppig, K. A., Clayton, E. W., … Holm, I. A. (2022). Impact of returning unsolicited genomic results to nongenetic health care providers in the eMERGE III network. Genetics in Medicine, 24(6), 1297–1305. doi:10.1016/j.gim.2022.02.018CrossRef Google Scholar PubMed

Mangiafico, S. (2022). Functions to support extension education program evaluation [R package rcompanion version 2.4.13]. Retrieved from https://CRAN.R-project.org/package=rcompanion.Google Scholar

Mavaddat, N., Michailidou, K., Dennis, J., Lush, M., Fachal, L., Lee, A., … Easton, D. F. (2019). Polygenic risk scores for prediction of breast cancer and breast cancer subtypes. American Journal of Human Genetics, 104(1), 21–34. doi:10.1016/j.ajhg.2018.11.002CrossRef Google Scholar PubMed

Meier, L., Van De Geer, S., & Bühlmann, P. (2008). The group lasso for logistic regression. Journal of the Royal Statistical Society. Series B, Statistical Methodology, 70(1), 53–71. doi:10.1111/j.1467-9868.2007.00627.xCrossRef Google Scholar

Mostafavi, H., Harpak, A., Agarwal, I., Conley, D., Pritchard, J. K., & Przeworski, M. (2020). Variable prediction accuracy of polygenic scores within an ancestry group. ELife, 9. doi:10.7554/eLife.48376CrossRef Google Scholar PubMed

Mullins, N., Forstner, A. J., O'Connell, K. S., Coombes, B., Coleman, J. R. I., Qiao, Z., … Andreassen, O. A. (2021). Genome-wide association study of over 40000 bipolar disorder cases provides new insights into the underlying biology. Nature Genetics, 53(6), 817–829. doi:10.1101/2020.09.17.20187054CrossRef Google Scholar

Murray, G. K., Lin, T., Austin, J., McGrath, J. J., Hickie, I. B., & Wray, N. R. (2021). Could polygenic risk scores Be useful in psychiatry?: A review. JAMA Psychiatry, 78(2), 210–219. doi:10.1001/jamapsychiatry.2020.3042CrossRef Google Scholar PubMed

Pashayan, N., Pharoah, P. D. P., Schleutker, J., Talala, K., Tammela, T. L. J., Määttänen, L., … Auvinen, A. (2015). Reducing overdiagnosis by polygenic risk-stratified screening: Findings from the Finnish section of the ERSPC. British Journal of Cancer, 113(7), 1086–1093. doi:10.1038/bjc.2015.289CrossRef Google Scholar PubMed

Peskoe, S. B., Arterburn, D., Coleman, K. J., Herrinton, L. J., Daniels, M. J., & Haneuse, S. (2021). Adjusting for selection bias due to missing data in electronic health records-based research. Statistical Methods in Medical Research, 30(10), 2221–2238. doi:10.1177/09622802211027601CrossRef Google Scholar PubMed

Pet, D. B., Holm, I. A., Williams, J. L., Myers, M. F., Novak, L. L., Brothers, K. B., … Clayton, E. W. (2019). Physicians’ perspectives on receiving unsolicited genomic results. Genetics in Medicine, 21(2), 311–318. doi:10.1038/s41436-018-0047-zCrossRef Google Scholar PubMed

Polygenic Risk Score Task Force of the International Common Disease Alliance (2021). Responsible use of polygenic risk scores in the clinic: Potential benefits, risks and gaps. Nature Medicine, 27(11), 1876–1884. doi:10.1038/s41591-021-01549-6CrossRef Google Scholar

Prictor, M., Teare, H. J. A., & Kaye, J. (2018). Equitable participation in biobanks: The risks and benefits of a ‘dynamic consent’ approach. Frontiers in Public Health, 6, 253. doi:10.3389/fpubh.2018.00253CrossRef Google Scholar PubMed

The Schizophrenia Working Group of the Psychiatric Genomics Consortium, Ripke, S., Walters, J. T. R., & O'Donovan, M. C. (2020). Mapping genomic loci prioritises genes and implicates synaptic biology in schizophrenia. Nature, 604(7906), 502–508. doi:10.1101/2020.09.12.20192922Google Scholar

Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisacek, F., Sanchez, J.-C., & Müller, M. (2011). pROC: An open-source package for R and S + to analyze and compare ROC curves. BMC Bioinformatics, 12, 77. doi:10.1186/1471-2105-12-77CrossRef Google Scholar

Seaman, S. R., & White, I. R. (2013). Review of inverse probability weighting for dealing with missing data. Statistical Methods in Medical Research, 22(3), 278–295. doi:10.1177/0962280210395740CrossRef Google Scholar PubMed

Sharp, S. A., Rich, S. S., Wood, A. R., Jones, S. E., Beaumont, R. N., Harrison, J. W., … Oram, R. A. (2019). Development and standardization of an improved type 1 diabetes genetic risk score for use in newborn screening and incident diagnosis. Diabetes Care, 42(2), 200–207. doi:10.2337/dc18-1785CrossRef Google Scholar PubMed

Smoller, J. W. (2018). The use of electronic health records for psychiatric phenotyping and genomics. American Journal of Medical Genetics. Part B, Neuropsychiatric Genetics, 177(7), 601–612. doi:10.1002/ajmg.b.32548CrossRef Google Scholar PubMed

Swanson, J. M. (2012). [Review of The UK Biobank and selection bias]. The Lancet, 380(9837), 110. doi:10.1016/S0140-6736(12)61179-9CrossRef Google Scholar

Thaweethai, T., Arterburn, D. E., Coleman, K. J., & Haneuse, S. (2021). Robust inference when combining inverse-probability weighting and multiple imputation to address missing data with application to an electronic health records-based study of bariatric surgery. The Annals of Applied Statistics, 15(1), 126–147. doi:10.1214/20-AOAS1386CrossRef Google Scholar

Tibshirani, R. (1997). The lasso method for variable selection in the Cox model. Statistics in Medicine, 16(4), 385–395. doi:10.1002/(sici)1097-0258(19970228)16:4<385::aid-sim380>3.0.co;2-33.0.CO;2-3>CrossRef Google Scholar PubMed

Tyrrell, J., Zheng, J., Beaumont, R., Hinton, K., Richardson, T. G., Wood, A. R., … Tilling, K. (2021). Genetic predictors of participation in optional components of UK Biobank. Nature Communications, 12(1), 886. doi:10.1038/s41467-021-21073-yCrossRef Google Scholar PubMed

van Alten, S., Domingue, B. W., Galama, T., & Marees, A. T. (2022). Reweighting the UK Biobank to reflect its underlying sampling population substantially reduces pervasive selection bias due to volunteering (p. 2022.05.16.22275048). doi:10.1101/2022.05.16.22275048CrossRef Google Scholar

Wei, W.-Q., Bastarache, L. A., Carroll, R. J., Marlo, J. E., Osterman, T. J., Gamazon, E. R., … Denny, J. C. (2017). Evaluating phecodes, clinical classification software, and ICD-9-CM codes for phenome-wide association studies in the electronic health record. PLoS ONE, 12(7), e0175508. doi:10.1371/journal.pone.0175508CrossRef Google Scholar PubMed

Wiesner, G. L., Kulchak Rahm, A., Appelbaum, P., Aufox, S., Bland, S. T., Blout, C. L., … Leppig, K. A. (2020). Returning results in the genomic era: Initial experiences of the eMERGE network. Journal of Personalized Medicine, 10(2). doi:10.3390/jpm10020030CrossRef Google Scholar PubMed

Zheutlin, A. B., Dennis, J., Karlsson Linnér, R., Moscati, A., Restrepo, N., Straub, P., … Smoller, J. W. (2019). Penetrance and pleiotropy of polygenic risk scores for schizophrenia in 106 160 patients across four health care systems. The American Journal of Psychiatry, 176(10), 846–855. doi:10.1176/appi.ajp.2019.18091085CrossRef Google Scholar PubMed

Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476), 1418–1429. doi:10.1198/016214506000000735CrossRef Google Scholar

Lee et al. supplementary material 1

File 1.6 MB

Lee et al. supplementary material 2

File 96.4 KB

Article contents

Impact of selection bias on polygenic risk score estimates in healthcare settings

Abstract

Keywords

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests