Enhancing selection of alcohol consumption-associated genes by random forest

Chenglin Lyu; Roby Joehanes; Tianxiao Huan; Daniel Levy; Yi Li; Mengyao Wang; Xue Liu; Chunyu Liu; Jiantao Ma

doi:10.1017/S0007114524000795

Enhancing selection of alcohol consumption-associated genes by random forest

Published online by Cambridge University Press: 12 April 2024

Yi Li ,

Xue Liu ,

Chunyu Liu and

Jiantao Ma

Show author details

Chenglin Lyu: Affiliation:
Department of Biostatistics, Boston University School of Public Health, Boston, MA 02118, USA Department of Anatomy and Neurobiology, Boston University Chobanian & Avedisian School of Medicine, Boston, MA 02118, USA
Roby Joehanes: Affiliation:
Framingham Heart Study and Population Sciences Branch, NHLBI, Framingham, MA 01702, USA
Tianxiao Huan: Affiliation:
Framingham Heart Study and Population Sciences Branch, NHLBI, Framingham, MA 01702, USA
Daniel Levy: Affiliation:
Framingham Heart Study and Population Sciences Branch, NHLBI, Framingham, MA 01702, USA
Yi Li: Affiliation:
Department of Biostatistics, Boston University School of Public Health, Boston, MA 02118, USA
Mengyao Wang: Affiliation:
Department of Biostatistics, Boston University School of Public Health, Boston, MA 02118, USA
Xue Liu: Affiliation:
Department of Biostatistics, Boston University School of Public Health, Boston, MA 02118, USA
Chunyu Liu*: Affiliation:
Department of Biostatistics, Boston University School of Public Health, Boston, MA 02118, USA
Jiantao Ma*: Affiliation:
Nutrition Epidemiology and Data Science, Friedman School of Nutrition Science and Policy, Tufts University, Boston, MA 02111, USA
*: *Corresponding authors: Chunyu Liu, email: liuc@bu.edu; Jiantao Ma, email: jiantao.ma@tufts.edu
*Corresponding authors: Chunyu Liu, email: liuc@bu.edu; Jiantao Ma, email: jiantao.ma@tufts.edu

Article contents

Abstract
Footnotes
References

Get access

Rights & Permissions

Abstract

Machine learning methods have been used in identifying omics markers for a variety of phenotypes. We aimed to examine whether a supervised machine learning algorithm can improve identification of alcohol-associated transcriptomic markers. In this study, we analysed array-based, whole-blood derived expression data for 17 873 gene transcripts in 5508 Framingham Heart Study participants. By using the Boruta algorithm, a supervised random forest (RF)-based feature selection method, we selected twenty-five alcohol-associated transcripts. In a testing set (30 % of entire study participants), AUC (area under the receiver operating characteristics curve) of these twenty-five transcripts were 0·73, 0·69 and 0·66 for non-drinkers v. moderate drinkers, non-drinkers v. heavy drinkers and moderate drinkers v. heavy drinkers, respectively. The AUC of the selected transcripts by the Boruta method were comparable to those identified using conventional linear regression models, for example, AUC of 1958 transcripts identified by conventional linear regression models (false discovery rate < 0·2) were 0·74, 0·66 and 0·65, respectively. With Bonferroni correction for the twenty-five Boruta method-selected transcripts and three CVD risk factors (i.e. at P < 6·7e-4), we observed thirteen transcripts were associated with obesity, three transcripts with type 2 diabetes and one transcript with hypertension. For example, we observed that alcohol consumption was inversely associated with the expression of DOCK4, IL4R, and SORT1, and DOCK4 and SORT1 were positively associated with obesity, and IL4R was inversely associated with hypertension. In conclusion, using a supervised machine learning method, the RF-based Boruta algorithm, we identified novel alcohol-associated gene transcripts.

Keywords

Alcohol consumption Gene expression CVD Machine learning random forest Boruta

Type: Research Article
Information: British Journal of Nutrition , Volume 131 , Issue 12 , 28 June 2024 , pp. 2058 - 2067

DOI: https://doi.org/10.1017/S0007114524000795 [Opens in a new window]
Copyright: © The Author(s), 2024. Published by Cambridge University Press on behalf of The Nutrition Society

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

Footnotes

†

These authors contributed equally to this work

References

Emanuele, NV, Swade, TF & Emanuele, MA (1998) Consequences of alcohol use in diabetics. Alcohol Health Res World 22, 211–219.Google Scholar PubMed

Chait, A, Mancini, M, February, AW, et al. (1972) Clinical and metabolic study of alcoholic hyperlipidaemia. Lancet 2, 62–64.CrossRef Google Scholar PubMed

Collaborators GBDA (2018) Alcohol use and burden for 195 countries and territories, 1990–2016: a systematic analysis for the Global Burden of Disease Study 2016. Lancet 392, 1015–1035.CrossRef Google Scholar

Chikritzhs, TN, Naimi, TS, Stockwell, TR, et al. (2015) Mendelian randomisation meta-analysis sheds doubt on protective associations between ‘moderate’ alcohol consumption and coronary heart disease. Evid Based Med 20, 38.CrossRef Google Scholar

Stockwell, T, Zhao, J, Panwar, S, et al. (2016) Do ‘Moderate’ drinkers have reduced mortality risk? A systematic review and meta-analysis of alcohol consumption and all-cause mortality. J Stud Alcohol Drugs 77, 185–198.CrossRef Google Scholar

Huan, T, Esko, T, Peters, MJ, et al. (2015) A meta-analysis of gene expression signatures of blood pressure and hypertension. PLoS Genet 11, e1005035.CrossRef Google Scholar PubMed

Yao, C, Chen, BH, Joehanes, R, et al. (2015) Integromic analysis of genetic variation and gene expression identifies networks for cardiovascular disease phenotypes. Circulation 131, 536–549.CrossRef Google Scholar PubMed

Benton, MC, Lea, RA, Macartney-Coxson, D, et al. (2013) Mapping eQTLs in the Norfolk Island genetic isolate identifies candidate genes for CVD risk traits. Am J Hum Genet 93, 1087–1099.CrossRef Google Scholar PubMed

Ma, J, Huang, A, Yan, K, et al. (2023) Blood transcriptomic biomarkers of alcohol consumption and cardiovascular disease risk factors: the Framingham Heart Study. Hum Mol Genet 32, 649–658.CrossRef Google Scholar PubMed

Luo, J, Wu, M, Gopukumar, D, et al. (2016) Big Data application in biomedical research and health care: a literature review. Biomed Inform Insights 8, 1–10.CrossRef Google Scholar PubMed

Breiman, L (2001) Random forests. Machine Learning 45, 5–32.CrossRef Google Scholar

Hu, J & Szymczak, S (2023) A review on longitudinal data analysis with random forest. Brief Bioinform 24, bbad002.CrossRef Google Scholar PubMed

Degenhardt, F, Seifert, S & Szymczak, S (2019) Evaluation of variable selection methods for random forests and omics data sets. Brief Bioinform 20, 492–503.CrossRef Google Scholar PubMed

Cammarota, C & Pinto, A (2021) Variable selection and importance in presence of high collinearity: an application to the prediction of lean body mass from multi-frequency bioelectrical impedance. J Appl Stat 48, 1644–1658.CrossRef Google Scholar

Swan, AL, Mobasheri, A, Allaway, D, et al. (2013) Application of machine learning to proteomics data: classification and biomarker identification in postgenomics biology. OMICS 17, 595–610.CrossRef Google Scholar PubMed

Kursa, M, Jankowski, A & Rudnicki, W (2010) Boruta – a system for feature selection. Fundam Inform 101, 271–285.CrossRef Google Scholar

Acharjee, A, Larkman, J, Xu, Y, et al. (2020) A random forest based biomarker discovery and power analysis framework for diagnostics research. BMC Med Genomics 13, 178.CrossRef Google Scholar PubMed

Liu, C, Ackerman, HH & Carulli, JP (2011) A genome-wide screen of gene–gene interactions for rheumatoid arthritis susceptibility. Hum Genet 129, 473–485.CrossRef Google Scholar PubMed

Steyerberg, EW, van der Ploeg, T & Van Calster, B (2014) Risk prediction with machine learning and regression methods. Biom J 56, 601–606.CrossRef Google Scholar PubMed

Polewko-Klim, A, Lesinski, W, Golinska, AK, et al. (2020) Sensitivity analysis based on the random forest machine learning algorithm identifies candidate genes for regulation of innate and adaptive immune response of chicken. Poult Sci 99, 6341–6354.CrossRef Google Scholar PubMed

Feinleib, M, Kannel, WB, Garrison, RJ, et al. (1975) The Framingham Offspring Study. Design and preliminary data. Prev Med 4, 518–525.CrossRef Google Scholar PubMed

Splansky, GL, Corey, D, Yang, Q, et al. (2007) The third generation cohort of the National Heart, Lung, and Blood Institute’s Framingham Heart Study: design, recruitment, and initial examination. Am J Epidemiol 165, 1328–1335.CrossRef Google Scholar PubMed

Liu, C, Marioni, RE, Hedman, AK, et al. (2018) A DNA methylation biomarker of alcohol consumption. Mol Psychiatry 23, 422–433.CrossRef Google Scholar PubMed

Joehanes, R, Ying, S, Huan, T, et al. (2013) Gene expression signatures of coronary heart disease. Arterioscler Thromb Vasc Biol 33, 1418–1426.CrossRef Google Scholar PubMed

Sun, X, Ho, JE, Gao, H, et al. (2021) Associations of alcohol consumption with cardiovascular disease-related proteomic biomarkers: the Framingham Heart Study. J Nutr 151, 2574–2582.CrossRef Google Scholar PubMed

Czuriga-Kovacs, KR, Czuriga, D, Kardos, L, et al. (2019) Reply to letter: reversibility of hypertension-induced subclinical vascular changes: do the new ACC/AHA 2017 blood pressure guidelines and heart rate changes make a difference? J Clin Hypertens (Greenwich) 21, 1243–1244.CrossRef Google Scholar

Kursa, M & Rudnicki, W (2010) Feature selection with the Boruta package. J Stat Software 36, 13.CrossRef Google Scholar

Martens, M, Ammar, A, Riutta, A, et al. (2021) WikiPathways: connecting communities. Nucleic Acids Res 49, D613–D21.CrossRef Google Scholar PubMed

Mootha, VK, Lindgren, CM, Eriksson, KF, et al. (2003) PGC-1-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet 34, 267–273.CrossRef Google Scholar

Subramanian, A, Tamayo, P, Mootha, VK, et al. (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A 102, 15545–15550.CrossRef Google Scholar PubMed

Thomas, PD, Ebert, D, Muruganujan, A, et al. (2022) PANTHER: making genome-scale phylogenetics accessible to all. Protein Sci 31, 8–22.CrossRef Google Scholar

Liaw, A & Wiener, M (2002) Classification and regression by randomForest. R News 2, 18–22.Google Scholar

Robin, X, Turck, N, Hainard, A, et al. (2011) pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinf 12, 77.CrossRef Google Scholar

Kursa, MB (2014) Robustness of Random Forest-based gene selection methods. BMC Bioinf 15, 8.CrossRef Google Scholar PubMed

Shen, J, Qi, L, Zou, Z, et al. (2020) Identification of a novel gene signature for the prediction of recurrence in HCC patients by machine learning of genome-wide databases. Sci Rep 10, 4435.CrossRef Google Scholar

Lin, MS, Jo, SY, Luebeck, J, et al. (2023) Transcriptional immune suppression and upregulation of double stranded DNA damage and repair repertoires in ecDNA-containing tumors. bioRxivCrossRef Google Scholar

Long, NP, Park, S, Anh, NH, et al. (2019) High-throughput omics and statistical learning integration for the discovery and validation of novel diagnostic signatures in colorectal cancer. Int J Mol Sci 20, 296.CrossRef Google Scholar PubMed

Dessie, EY, Gautam, Y, Ding, L, et al. (2023) Development and validation of asthma risk prediction models using co-expression gene modules and machine learning methods. Sci Rep 13, 11279.CrossRef Google Scholar

Yengo, L, Vedantam, S, Marouli, E, et al. (2022) A saturated map of common genetic variants associated with human height. Nature 610, 704–712.CrossRef Google Scholar PubMed

Singh, A, Shannon, CP, Gautier, B, et al. (2019) DIABLO: an integrative approach for identifying key molecular drivers from multi-omics assays. Bioinformatics 35, 3055–3062.CrossRef Google Scholar PubMed

Wekesa, JS & Kimwele, M (2023) A review of multi-omics data integration through deep learning approaches for disease diagnosis, prognosis, and treatment. Front Genet 14, 1199087.CrossRef Google Scholar PubMed

Lyu et al. supplementary material

File 815.2 KB

Article contents

Enhancing selection of alcohol consumption-associated genes by random forest

Abstract

Keywords

Access options

Article purchase

Temporarily unavailable

Footnotes

References

Lyu et al. supplementary material

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests