Introduction
Seasonal influenza has always been a major public health problem [Reference Barnett1, Reference Iuliano2]. It annually causes tens of millions of respiratory illnesses and hundreds of thousand deaths worldwide [Reference Ginsberg3]. An accurate forecast of influenza activity in advance based on predictive models is crucial for public health authorities to predict the seasonal fluctuation and facilitate key response actions [Reference Chretien4, Reference Axelsen5], such as public health surveillance, deployment of emergency supplies and hospital resource management. However, accurate prediction remains a great challenge. A number of statistical approaches have been employed and evaluated. Random forest (RF) regression model was suggested to have enhanced prediction ability over the autoregressive integrated moving average (ARIMA), the generalized linear autoregressive moving average time series model [Reference Petukhova6, Reference Kane7] in context of animal influenza activity prediction. It performed better in identifying independent factors associated with H1N1pdm influenza infections over boosted regression trees, conventional and penalised logistic regression [Reference Mansiaux and Carrat8].
Meteorology plays an important role in the varied seasaonal patterns of influenza in temperate, subtroptical and tropical regions. Influenza activity has been reported to peak during rainy seasons in tropical climates and during dry, cold months of winter in temperate climates. The impact of climate conditions on influenza A and B could be different [Reference Iha9].
Influenza like illness (ILI) has been commonly used as the index of influenza activity worldwide [Reference Ginsberg3–Reference Axelsen5], however, a number of respiratory pathogenicities, including parainfluenza, adenovirus and rhinovirus, could cause ILI and thus, influence ILI activity fluctuation [Reference Huo10]. Recently, the positive rate of influenza virus in surveillance samples has been considered a more reliable indicator of influenza activity [Reference Shaman11,Reference Nishiura12].
Jiangsu Province is situated in the middle east coast of China and is a transitional district of warm temperate zone to subtropical zone. Researches conducted in this region could deliver a more comprehensive understanding of climate impact on influenza activity. In this study, we aim to develop RF models to predict the ILI activity, positive rates of Flu-A and Flu-B, respectively, which has been rare in published studies.
Materials and methods
Data sources
Surveillance of ILI and influenza virus in China is conducted through a national sentinel network [Reference Yu13], with sentinel sites covering 2.5% of all hospitals across the country. Data for patients fitting the definition of ILI (i.e. body temperature ⩾38 °C with a cough and/or a sore throat) is reported to the China Influenza Surveillance Information System (CISIS) on a weekly basis. In Jiangsu, for each sentinel site, no less than 20 nasopharyngeal swabs are collected in a week by convenience samples of ILI cases before antiviral therapy. These specimens are routinely tested for influenza virus subtypes using real-time fluorescent quantitative polymerase chain reaction (PCR) assay and the results are reported to CISIS within 48 h.
In this study, weekly data of ILI percentage in outpatients (ILI%) and influenza virus positive rate in Jiangsu during 2011–2016 were obtained from CISIS. The daily meteorological data were downloaded from China Meteorological Data Sharing Service System (http://cdc.cma.gov.cn) and aggregated into weekly data. These meteorological viriables include precipitation (PR), sunshine duration (SD), relative humidity (RH), atmospheric pressure (AP), minimum temperature (MIN_T), mean temperature (MEAN_T) and maximum temperature (MAX_T).
Random forest
Rf is an ensemble machine learning method proposed by Breiman [Reference Breiman14]. RF creates multiple classification and regression trees, each trained on a bootstrap sample of the original training data with a randomly selected subset of input variables. There are two parameters to choose when running a RF algorithm: the number of trees and the number of randomly selected variables. In regression, the tree predictor takes on numerical values as opposed to class labels used by the RF classifier. RF regression models take the average of outputs produced by the trees as the final prediction.
One of the most important features of RF is to calculate the variable's importance, which measures the association between a given variable and the prediction accuracy. RF regression approach discussed in this study uses the decrease in accuracy to assess the variable's importance. As suggested by previous studies about the good prediction capacity, we explored RF method in human seasonal influenza activity analyses, testing its forecasting and independent influence factors identifying performance.
Model evaluation
Data of 2011–2015 were split as a training set to fit the RF models, reserving 2016 as testing set to evaluate the predicting accuracy. Coefficient of determination (R2) and mean absolute percentage error (MAPE) were employed to evaluate the models' performance both in the model fitting stage and prospective forecasting stage. They were calculated as follows:
where y i means the i th observation, $\hat y_i$ means the i th predication, $\bar y$ means average of observations and n is the number of observations.
Statistics analysis
Descriptive statistics was used to illustrate the temporal pattern of ILI% and the influenza virus positive rate. Time series analysis methods were employed to identify the autoregressive order [15] of the dependent variables (i.e. ILI%, positive rate of Flu A and positive rate of Flu B). Cross correlation is a measure of association of a time series with another time series at different lags [Reference Wang, Wang and Chen16], which is essentially a univariate correlation method. In this study, cross correlation was used to determine the lag of climate variable that was most significantly associated with dependent variables. All the analyses in this study were completed using R version 3.5.0. Particularly, cross-correlation analyses were completed using the R package ‘TSA’. RF model fitting and forecasting were done in the R package‘randomForest’ [Reference Liaw and Wiener17].
Results
General description
More than 2 million ILI cases were reported to CISIS from the sentinel sites in Jiangsu province during the study period, with an average weekly ILI of 3.92%. Totally 146 236 throat swabs were sampled from the ILI cases. Influenza viruses were detected in 16 197 swabs through real time RT-PCR, reaching a general positive rate of 11.08%. According to the typing results, Flu-A and Flu-B accounted for 64.27% and 35.73% of all influenza positive samples, reaching an average positive rate of 7.12% and 3.96%, respectively. Two peaks were observed in the ILI activity and the positive rate of Flu-A in each year, one occurred in winter and the other in summer. While the positive rate of Flu-B just showed a winter peak in a year (Fig. 1). The features of the meteorological variables were summarised in Table 1.
Correlation analysis
As shown in Table 2, AP and PR were significantly correlated with ILI% at lag 0–4. Mean_T and Min_T were also correlated with ILI% but with no lag effect. Max_T, RH and SD presented no relationship with ILI%. As to Flu-A, AP showed correlations at lag 0–3. The three temperature variables presented correlations at lag 0–4. All the meteorological factors were identified significant correlations with Flu-B at lag 0–4. The results of autocorrelation analysis were displayed in Fig. 2. ILI% presented autocorrelation at lag 1, while both Flu-A and Flu-B at lag 3.
*statistically significant at 0.05.
RF model fitting and forecasting
Three RF models with optimum parameters were finally constructed to predict ILI activity, Flu-A and Flu-B positive rates in Jiangsu province, including 13, 23 and 39 predictors, respectively. The dependent variable of Flu-A had undergone a natural logarithmic transformation before the model fitting. See Table 3.
The performance of the models is summarised in Table 4 and the predicting results are displayed in Fig. 3. The models for Flu-B and ILI% presented excellent performance both in model fitting stage and prospective forecasting stage, with MAPEs less than 10%. The model for Flu-A presented much worse than the other two, with MAPE up to19.49% in the test set. Nevertheless, the predicted values matched the real trend very well.
Variable importance
In each model, the lagged dependent variable was the most important of all predictors. The time variable presented as important in the models for ILI and Flu-A. Most of the meteorological factors and their lagged terms had the potential to improve the accuracy of the models to a certain degree, but their effects differed across the three models. For ILI forecasting, the weekly MEAN_T, AP and one order lagged AP were more important than the rest. For Flu-A, the lagged temperature specific variables were relatively important. With regard to Flu-B, the lagged AP and MAX_T presented greater effects than the other meteorological variables to improve the model accuracy. See Fig. 4.
Discussion
Forecasting of influenza activity in human populations is crucial for influenza prevention and control [Reference Chretien4]. Many methods have been introduced for this purpose. As a conventional univariate model, ARIMA technique has been commonly used to forecast seasonal influenza surveillance at national, regional and local levels [Reference Paul18–Reference Wang20]. ARIMA model is virtually a linear method. It can achieve good predication when the variation contained in the data is relatively stable. In practice, however, the long-term trend and seasonality of influenza activity change over time, so that the ARIMA model cannot always reach a satisfactory result.
Substantial studies have proposed that influenza activity is climate-sensitive [Reference Tamerius21–Reference Shaman23]. Climatic factors may influence the survival and spread of influenza viruses in the environment, the host susceptibility and exposure probability [Reference Hemmes, Winkler and Kool24–Reference Lipsitch and Viboud26]. The effects of meteorological factors on epidemics of ILI have attracted considerable interest recently. Sudarat Chadsuthi, et al. [Reference Chadsuthi27] fitted ARIMA model with temperature and RH as covariates to forecast the incidence of influenza in Thailand. N'gattia1, et al. [Reference N'Gattia28] also developed ARIMA with meteorological variable rainfall to forecast influenza transmission. But the prediction accuracy of these models was not good enough and the climate variables did not clearly optimise the models. In this study, we employed RF algorithm fitting models to predict influenza activity with meteorological factors in Jiangsu province, China. In contrast with previous studies, we constructed predicting models not only for ILI but also for the positive rates of influenza virus (i.e. flu-A and flu-B). All the models performed very well in our dataset. Based on them, we can comprehensively and systematically evaluate the influenza activity in the future, which has significant and practical meaning for influenza prevention and control. Given the good performance of RF in influenza prediction, the models we established could be used for influenza (sub)type-specific early warning and to evoke early intervention. The key meteorological factors identified could be used for publicity, to elevate the general population's consciousness and engagement in influenza prevention.
Similar to many other members of the machine learning family such as artificial neural networks, RF model cannot explain the association between risk factors and influenza activity. But RF can assess the importance of each variable on the accuracy of prediction [Reference Breiman14, Reference Ong29], which is essential to optimise the model and may provide clues for the further study of influenza risk factors. In this study, we found that the lagged dependent variables (i.e. the proportion of ILI in the outpatients and positive rates of flu-A and flu-B) in the previous weeks were more important than meteorological factors in the models. It suggests that these models took advantage of the autocorrelation of the dependent variables. The influenza activity in Jiangsu province presented obvious seasonality which is a critical feature to fit predicting model. However, RF is unable to learn the seasonal patterns because of randomly selecting samples for each tree. In this study, we introduced a time variable into the models to fit the seasonal variance of ILI and positive rates of influenza viruses. The importance analysis shows that it played a significant role to improve the models. This strategy is worthy of reference when fitting the similar RF models. Compared with other multivariate predicting methods [Reference Chadsuthi27, Reference N'Gattia28], RF is not subject to multicollinearity, mainly due to randomly selecting variables for each tree in RF [Reference Ong29, Reference Carvajal30]. In this study, we selected predictor variables through cross-correlation analysis. The meteorological factors and their lagged terms were incorporated into the models so long as they were identified to be significantly correlated with the dependent variables. All of them presented some degree of importance, which suggested that the RF models comprehensively combined the climatic variables and their hysteresis effects. Furthermore, the importance of the meteorological factors differed across the three models, which may suggest that the influence of meteorological factors differs between ILI, flu-A and flu-B. The causes of this difference and its practical significance for influenza surveillance deserve further studies.
In this study, humidity and PR were not recognised as major meteorological factors related to ILI activity, positive rate of flu A and B, while the temperature was identified as the main driver. This is consistent with our previous study [Reference Dai31]. The present study also indicates that AP plays an important role in the activity of ILI and flu B. An increased influenza risk associated with rising AP was also reported in another subtropical region in China, using distributed lag nonlinear model [Reference Guo32].
Our study suggests that the selected meteorological variables contributed less to the fluctuations of ILI, flu A and B, compared with the effect of autocorrelation, which has been shown as the most important of independent variables. Monamele GC, et al. also supposed that meteorological parameters could only explain no more than 30% of the influenza activity variation [Reference Monamele33]. Although our constructed RF models showed desirable predictive ability, especially for ILI and flu B, more meteorological factors, such as specific humidity and absolute humidity, and population-specific immunity level [Reference Mansiaux and Carrat8] are warranted to be evaluated to improve the prediction of type/subtype-specific activity [Reference Pan34].
Conclusion
RF model is a good method to predict the influenza activity. Three RF models were constructed to predict the positive rate of influenza viruses and ILI incidence and performed very well. The autocorrelation and seasonal variation contained in the data of the dependent variables are crucial for the prediction models. Meanwhile, the effects of meteorological factors and cumulative effects over a period of time were combined to improve the models. Further researches are warranted to explore RF model with meteorological factors as well as other variables and it has the potential to be a useful tool for predicting other major infectious diseases.
Acknowledgements
This work was supported by Jiangsu Provincial Major Science & Technology Demonstration Project (No. BE2017749), Jiangsu Provincial natural science foundation (No. BK20151595), Jiangsu Provincial Medical Youth Talent (No. QNRC2016542, QNRC2016539), Preventive medicine research program (Y2018074) and Key Medical Discipline of Epidemiology (No. ZDXK A2016008).
Author contributions
WL, HX conceived and designed the experiments. WL, HX, JH, CB performed the experiments. WL, JB, KX, QD analysed the data. WL, YW, YS, WS contributed materials/analysis tools. WL, XH wrote the paper. All authors read and approved the final manuscript.
Ethical standards
According to the National Health Commission of China, infectious diseases surveillance was exempt from institutional review board assessment. The dataset was anonymised in the national reporting system (CISIS) except for individuals with special access and was anonymised before data analyses.