INTRODUCTION
Campylobacter sp. bacteria are motile, spiral-shaped, Gram-negative organisms found ubiquitously in the environment [Reference Snelling1, Reference Humphrey, O'Brien and Madsen2]. They have been identified as a leading cause of human gastroenteritis in developed nations, surpassing pathogens such as Salmonella sp. and E. coli [Reference Allos, Taylor, Evans and Brachman3, Reference Altekruse, Swerdlow, Cliver and Riemann4]. An estimated 1% of the US population (2 400 000 persons) are infected annually resulting in 13 000 hospitalizations and 124 deaths [Reference Tauxe5]. Campylobacter sp. can be found in the gastrointestinal tracts of a wide variety of domestic and wild animals and birds [Reference Frost6–Reference Newell and Wassenaar8]. As a result, establishment of causative associations between human infection and contaminated food or water, animal contact and other environmental sources is a formidable task. Geographic region, climate patterns, drinking and recreational water, land use and human behaviour comprise some of the complex set of determinants which have been shown to affect the rate of gastrointestinal disease [Reference Jepsen, Simonsen and Ethelberg9–Reference Zhang, Bi and Hiller15]. The incidence of campylobacteriosis varies seasonally and geographically, and tends to be highest in summer months, specifically in temperate climate zones [Reference Allos, Taylor, Evans and Brachman3, Reference Bi14, Reference Nylen16, Reference Tam17]. While the seasonality of the disease has been well documented worldwide, extensive studies have not been performed to predict the future risk of disease in different geographic regions in the USA. Comparing seasonal patterns in regions with different environmental characteristics may help identify transmission routes making reliable time-series forecasting of great benefit to epidemiologists and public health officials [Reference Hartnack10, Reference Nobre18–Reference Altizer20].
A variety of modelling approaches have been applied to surveillance data over the past 20 years in an attempt to accurately predict patterns of infectious diseases [Reference Bi14, Reference Zhang, Bi and Hiller15, Reference Nobre18, Reference De Greeff19, Reference Tokars21–Reference Naumova29]. Statistical time-series modelling is appropriate since Campylobacter sp. disease surveillance data can be aggregated into equally spaced time intervals, exhibits autocorrelation, trend and seasonality [Reference Williamson and Weatherby Hudson27, Reference Naumova and MacNeil30]. The potential for emerging infectious disease patterns to change in response to anthropogenic climate and land-use changes warrants the continual improvement and updating of current forecasting systems. Technological advances in forecasting software and program capability produce systematic review of methods and their applicability in the realm of public health. Recent interest in automated, real-time detection techniques have met with varying levels of success [Reference Rolfhamre and Ekdahl28]. Our study incorporates a univariate methodological approach to forecast monthly disease risk using campylobacteriosis incidence from three US states.
Finding the most accurate time-series disease risk model at the state level holds numerous practical implications. Systematic analyses of multiple modelling techniques aims to create an optimal model to be used by public health officials with a state-specific, accurate and user-friendly method for predicting disease risk. The best model could potentially be implemented by trained public health professionals. Risk forecasting could provide public health officials with an early indication of irregularity in disease incidence and act as an epidemic alert system [Reference Nobre18, Reference Medina24, Reference Williamson and Weatherby Hudson27, Reference Cardinal, Roy and Lambert31, Reference Myers32]. Model application could subsequently result in more efficient and cost-effective control strategies [Reference Benschop33].
The purpose of this study is to evaluate three time-series models using data from three US states, Georgia, Oregon and Minnesota, to forecast the monthly risk of campylobacteriosis one year in advance. We also aim to determine if current software is capable of accurately simplifying time-series methods for practical use in the public health arena.
METHODS
Data source and study area description
The data utilized for this project were obtained from FoodNet, an active surveillance system implemented in 1996 by the Centers for Disease Control and Prevention (CDC) [34]. To meet the operational case definition of campylobacteriosis, samples of either stool or blood must be laboratory-confirmed as positive for Campylobacter sp.
Data from Georgia, Oregon and Minnesota were chosen for completeness and climatic diversity. Both direct and indirect disease transmission may be affected by weather conditions, therefore, it is important to predict disease risk for geographically diverse regions [Reference Bi14]. Oregon experiences temperate climatic conditions characterized by 9 months of consistent cloud cover and rain [35]. Regional variation in annual precipitation (50–500 cm) occurs. During the summer months, July–September, there are about 50 days of clear sky with average daily temperatures between 30 and 38°C. Georgia is characterized by a humid subtropical climate and receives about 114 cm of annual rain in the middle of the state and 180 cm in the northeast mountains [36]. Summers are hot and humid with an average daily temperature of 32°C. Minnesota climate is the most extreme, with average daily temperatures ranging in January between −14 and −11°C, and between 19 and 23°C in July [37]. Average annual precipitation is 48 cm in Minnesota's northwest region and 86 cm in the southeast. We hypothesize that climatic differences between states may affect the characteristics of the campylobacteriosis risk curve over the course of the year. Subsequently, this may influence statistical forecasting methods, as well as prevention and control strategies.
Data preparation
FoodNet surveillance data was aggregated into counts by month for each state over the study period resulting in 108 data points in Georgia and 120 data points in Oregon and Minnesota, equally spaced over time. The series lengths are statistically appropriate for the three time-series methods [Reference DeLurgio38]. To ensure the regional integrity of the risk estimates, cases identified as travel related were eliminated from the dataset. The years 1998 (1999 for Georgia) to 2007 were used to model each time series and the year 2008 was held out of the dataset for model validation. Data manipulation was performed in SAS version 9.2 [39]. Risks were determined using annual population estimates as denominators obtained from the U.S. Census Bureau [40]. The risk estimates were presented as number of cases/100 000 persons. The statistical analyses were performed in NCSS-2007 [Reference Hintze41]. The forecasting methods used were time-series regression, decomposition, and Box–Jenkins autoregressive integrated moving averages (ARIMA). Fit statistics and holdout R 2 values were calculated manually. Separate model forecasts were assessed for each state.
Pattern analysis and outlier identification
Pattern analysis was performed on monthly risk data using autocorrelation (ACF) and partial autocorrelation (PACF) plots. Kruskal–Wallis ANOVA was performed on monthly medians to verify seasonality (P<0·05). A simplistic strategy of identifying outliers as data points 3 s.d. from the mean for time-series data are invalid since this ignores the autoregressive or moving average patterns in the data. Instead, the time-series outliers were identified by fitting a basic ARIMA model to the data series. The resulting residuals are saved and standardized by the root mean square error for the ARIMA (1,0,0)(0,1,1). These standardized residuals are then control charted. Observations outside 3 s.d. from the mean of zero are then flagged as outliers in the time-series data. This outlier identification procedure avoids over-identification of outliers in time series [Reference Alwan42].
Outbreak information on individual cases is incomplete in this dataset. All cases were aggregated by month regardless of outbreak status. Outliers identified by control charting were individually checked for potential outbreak status. No association between outlier months and reported outbreak cases was found.
Time-series modelling techniques
The models were quantitatively evaluated based on their predictive ability using mean square error (MSE), MAPE (mean absolute percentage error), R 2 on the training data, and a holdout R 2 based on 2008 data for all three modelling techniques. Outside of time-series analysis, most people associate R 2 only with multiple regression. However, there is a pseudo-R 2 that can be computed for any time-series model as follows:
This pseudo-R 2 is simply the sum of the residuals squared divided by the total sum of squares in the model. For a holdout R 2, the calculation is basically the same as in equation (1), except that the model from the training sample is applied to the holdout, the sample size is only over the holdout, and is the mean for the holdout time period. In essence, all models can be evaluated in the same way. Henceforth, all further comparisons will be addressed simply as R 2.
Time-series regression
Ordinary least squares multiple regression models were evaluated using additive (untransformed) and multiplicative (logarithm transformation) risks. Predictors included trend, month, year and trend×month interactions. Variables were retained if they improved predictive value (R 2), produced globally significant (P<0·05) models with significant regression coefficients, and lacked collinearity (variance inflation index <5). The basic time-series regression model used was additive and shown in equation (2):
This model assumes linear trend and seasonality but no interaction between the two.
Residual time-series plots were examined for all models and checked for normality using the Shapiro–Wilk's goodness-of-fit for normality. In addition, one wants to find white noise (no pattern) in the residuals after fitting a time-series model. Therefore, the Portmanteau test was used to assess white noise, with degrees of freedom adjusted according to the number of predictor variables [Reference DeLurgio38]. This test ensures that the pattern has been fully extracted from the series and that the residuals are randomly scattered.
Automatic decomposition
A decomposition macro available in NCSS and other software was applied [Reference Hintze41]. The series was decomposed into trend, seasonal, cyclic and error components. The decomposition model that worked best on this data was multiplicative as shown in equation (3):
Residual analysis for white noise and normality was performed as described for time-series regression.
Box–Jenkins ARIMA
The ARIMA modelling was based on the techniques described by Box & Jenkins in 1976 and further explained by DeLurgio [Reference DeLurgio38, Reference Box and Jenkins43]. The ACF and PACF plots were used to identify starting orders. Exhaustive combinations of autoregressive (AR), moving average (MA) and differencing parameters were fitted up to the third order. Orders above three were not attempted due to the high likelihood of model overspecification. First-order seasonal differencing resulted in the best models for all three states and compensated for non-stationarity in the mean [Reference Box and Jenkins43]. The best models were selected after various fit statistics were evaluated. The best ARIMA model was ARIMA (1,0,0)(0,1,1), which is captured using backshift operators in equation (4):
Significant (P<0·05) coefficients were retained with correlations <0·8 between parameter estimates. Residual analysis, as to normality and white noise, was performed as described for time-series regression.
RESULTS
Pattern analysis
Monthly risks ranged from 0·236–1·191/100 000 persons (mean 0·593) in Georgia, 0·635–2·895 (mean 1·443) in Oregon and 0·333–4·655 (mean 1·435) in Minnesota. All three series demonstrate seasonality (Fig. 1a–c). The vacillating seasonal pattern in the ACF plots dominates and potentially masks AR and MA components. The ACF and PACF plots for Georgia are shown in Figure 2(a, b). The exponential decay in of the seasonality in the ACF along with the singular PACF first-order spike is indicative of AR(1). Further looks at regular and seasonal differencing hinted at a possible MA(1) for the seasonal component. The patterns were not clean, implying other model possibilities or potential outliers or both.
Outlier identification
Outliers were not identified in Georgia using the ARIMA control process techniques, therefore, no further pre-processing or smoothing methods were applied to the Georgia time series. For the Oregon series, June 1998 was flagged as an outlier in both the raw and residual control chart analysis. The mean risk in June was 2·11/100 000 persons. For the June observations a running median of five consecutive June values was chosen to preserve the seasonal effect. The original outlier value of 2·864/100 000 persons was replaced with 2·017. The models performed consistently worse with smoothed data. As a result, all Oregon forecasting was applied to the original unsmoothed data.
Control charting of the Minnesota series indicated that June 1998 was out of range for both control charting techniques. The data point was above 3 s.d. from centre. To correct the outlier, a running median of 5 for consecutive June data was chosen for smoothing. The replacement median risk value of 2·372 (original value 4·65, June mean 2·420) was used in all further analyses.
Time-series model results and comparisons
The results for the best models identified for each technique are summarized in Table 1. All ARIMA and regression models were significant (P<0·05) both globally and for individual model coefficients. Decomposition models do not rely on overall model significance testing to assess fit.
MSE, Mean square error; WN, white noise; MAPE, mean absolute percent error.
* PRED is a prediction R 2 value (sometimes referred to as a Press R 2).This statistic is used to internally validate the regression model using jackknife techniques.
† R 2 of the 2008 validation sample.
Regression
The best regression model for all three states was additive and contained statistically significant (P<0·05) trend and monthly estimates. The R 2 value in Georgia was 69·3%, in Oregon 71·0% and in Minnesota 83·5%. In all three states, normality of the residuals was achieved but not white noise.
Automatic decomposition
The decomposition risk predictions for campylobacteriosis resulted in the highest fit statistics of the three methods. The R 2 value for Georgia was 72·7%, for Oregon 76·3% and for Minnesota 89·9%. Normality in the residuals was achieved for all three series. None of these models attained perfect white noise. The Georgia model was adequate for white noise only for lags 1 and 2. The Oregon model was inadequate overall. The Minnesota residuals were adequate for white noise on lag periods 1, 4, and 7–11.
The actual and predicted risk values for the decomposition validation year (2008) are shown in Figure 3(a–c). The validation dataset R 2 values for Georgia, Oregon and Minnesota were 66·2%, 52·6% and 79·9%, respectively.
ARIMA
For all three states the most parsimonious model with highest R 2 value (Georgia 65·5%, Oregon 72·4%, Minnesota 84·1%) was ARIMA (1,0,0)(0,1,1). Better R 2 values were possible using AR and MA components higher than two. However, complex models tend to over-fit, demonstrate multicollinearity and frequently do not pass the assumptions of normality or achieve white noise. The majority of the models had markedly lower R 2 values than decomposition models, but the ARIMA models had a higher holdout R 2, except for Minnesota.
The observed seasonal variation in campylobacteriosis identifies the months of June, July and August as the highest risk months of disease for all three states. However, the overall shape of the curve differs across series (Fig. 4). Minnesota's annual curvature has the sharpest, narrowest seasonal peak until 2004 at which time the shape coincides closely with Oregon. The seasonal peak in Georgia is more rounded and less severe.
DISCUSSION
The automatic decomposition procedure resulted in a 4–7% improvement in R 2 over the best regression and ARIMA models. Comparing the three methods, decomposition was the fastest and least technical, achieved normality in the residuals yet was uniformly unsuccessful at achieving white noise. Lack of white noise implies that there is a pattern in the residuals not accounted for by the model. Residual patterns may increase model uncertainty. However, good predictive performance can be achieved without perfect attainment of white noise [Reference Burkom, Murphy and Shmueli22].
The use of ARIMA modelling for disease risk data is well documented [Reference Nobre18, Reference Williamson and Weatherby Hudson27, Reference Cardinal, Roy and Lambert31, Reference Benschop33, Reference Reichert44]. It was originally expected that ARIMA methods would be favoured based on previously published use with surveillance data, versatility and available prediction intervals [Reference Nobre18]. These data show that ARIMA models were closer to achieving white noise in the residuals and improved holdout sample fit statistics in Georgia and Oregon. Compared with automatic decomposition, this method is technically challenging, requiring significant statistical background for appropriate and accurate implementation. Regression had the poorest model R 2 results. Advantages of regression include the ease of interpretation, computation of prediction intervals, robust and bootstrapping possibilities. Therefore, this technique should not be ruled out for risk forecasting of campylobacteriosis. For all three methods, MSE and MAPE were comparable and indicate accurate forecasting. However, fit statistics for decomposition were uniformly better than the other two methods across states. Furthermore, constant 95% prediction intervals can be manually calculated for decomposition to demonstrate a range in predicted risk values.
A specific strength of automatic decomposition is that it produced accurate monthly campylobacteriosis risk predictions for all three states. A unique characteristic of this technique is that it can be taught to public health officials with minimal statistical background. By combining accuracy with ease of use, improvements in epidemic preparation and timely intervention are attainable at state, regional and national levels [Reference Myers32].
The distinct seasonal pattern of campylobacteriosis may suggest climatic or environmental links to the risk of disease [Reference Miller13, Reference Nylen16, Reference Tam17]. Climate affects the survival and reproduction of Campylobacter sp. in the environment and on food sources and previous studies have shown that climatic factors influence disease incidence over time [Reference Bi14, Reference Fleury25, Reference Patrick45, Reference Bi46]. Hartnack et al. found significant cross-correlations between human incidence, monthly temperature and rainfall [Reference Hartnack10]. The study showed that peak prevalence in human campylobacteriosis preceded that in German broiler flocks, further implicating environmental vs. foodborne components to disease risk. Seasonal risk variation may also be due to human behavioural factors such as picnics, barbeques and other outdoor activities [Reference Jepsen, Simonsen and Ethelberg9, Reference Naumova and MacNeil30]. Such behaviours may vary depending on the climatic and socioeconomic constraints of a geographic region. The timing of seasonal peaks in our study was comparable across states. This was in contrast to a recent study in Scotland which showed that the prominence of seasonal peak in incidence varied regionally [Reference Miller13]. However, both studies demonstrated differences in the shape of the seasonal curve by region or state. Future studies are needed to elucidate the impact of these factors on disease risk by dividing states into unique climatic zones for time-series analysis using environmental variables specific to different geographic regions.
Considerable variation was observed in validation data R 2 results across models and states. This may be a reflection of the model's predictive accuracy, shifts in disease patterns or reflect irregular values or outliers in the dataset. In Oregon and Minnesota, aberrant risk values were evident in 2006 seasonal peaks (Fig. 4). These data were not flagged by control charting and were not smoothed prior to the analysis. The presence of outliers, change points or interventions can alter patterns and invalidate forecasts. We believe our results would have improved in these states had 2006 followed the typical seasonal curvature. Second, surveillance systems can underestimate actual disease risk, and reporting may vary between states. As a result, predictions based on surveillance data should be interpreted with caution.
Over the past 20 years active modern surveillance systems have been implemented in developed nations that offer more accurate statistical prediction capacity than was previously possible [Reference Naumova29, Reference Myers32]. Risk data from surveillance systems can be modelled as a means of assessing associations between disease risk and epidemiological factors over time [Reference Hartnack10, Reference Myers32]. Detecting aberrant disease incidence can signal an impending epidemic [Reference Cardinal, Roy and Lambert31]. Currently, advanced software offers forecasting methods that are applicable for use by public health officials [Reference Nobre18, Reference Williamson and Weatherby Hudson27, Reference Rolfhamre and Ekdahl28]. These statistical computing techniques allow interdependence of observations in both time and space to be incorporated into epidemiological models. As a result the temporal structure of risk data may assist epidemiologists in modelling biological, environmental and behavioural factors of disease with greater accuracy than the classical one-dimensional regression framework [Reference Singer and Willett47]. As demonstrated in this study, these techniques may provide health officials with practical, user-friendly and accurate predictive warning systems based solely on previous risk data [Reference Williamson and Weatherby Hudson27]. The models can be implemented and validated monthly for the practical purpose of predicting the risk of campylobacteriosis. This information may be useful for public health professionals in early epidemic alert systems as well as adding to our knowledge of seasonal disease patterns over time.
ACKNOWLEDGEMENTS
We thank the Centers for Disease Control and Prevention FoodNet active surveillance for providing data for this study.
DECLARATION OF INTEREST
None.