INTRODUCTION
Salmonella is a major cause of human gastroenteritis and bacteraemia in both the industrialized and developing world, with most frequent cases being foodborne infections [Reference Domingues1, Reference Pires and Hald2]. Although Salmonella serovars Typhimurium and Enteritidis are the most common causes of human salmonellosis worldwide, other serovars have been reported to be more prevalent in some regions, e.g. S. Stanley and S. Weltevreden that are very common in South East Asia [Reference Aarestrup3–Reference Herikstad, Motarjemi and Tauxe5].
In a globalized world with constant movement of people, animals and goods, the importance of monitoring infectious diseases is growing. Alert systems, able to detect changes in disease trends and risk factors as soon as possible, are essential to allow quick control measures and prevent outbreaks and epidemics [Reference Reintjes, Baumeister and Coulombier6, Reference Zankari7]. One of the biggest challenges when developing these tools is how to handle data limitations. Surveillance datasets are often hampered by the lack of sufficient quality data and confounded by multiple human and environmental interactions.
Determining the sources of the infection of salmonellosis is often done retrospectively using clinical case information. To develop efficient methods for prospective surveillance it is fundamental not only to understand the mechanisms of infection development and spread but also survey the more probable sources of infection. However, monitoring of the frequent reservoirs for Salmonella is, in most countries, only based in sporadic sample collections rather than in focused structured surveillance programmes.
Statistical scans are a mathematical tool commonly used in outbreak detection and evaluation from cancer to infectious diseases including Salmonella [Reference Kulldorff8–Reference Alkhamis11]. These tools scan the data looking for unexpected cases, patterns or trends in spatial, temporal and also spatio-temporal dimensions.
The main objective of this study was to investigate spatio-temporal clusters using statistical scan methods on routine monitoring data of Salmonella infections. The dataset comprised isolate data from human and non-human sources in Thailand from 2002 to 2007.
A second objective was to evaluate the existence of serovar-specific associations between human and non-human clusters. This type of information could be useful in complementing incomplete datasets and helping the establishment of effective surveillance systems.
MATERIAL AND METHODS
Data
In Thailand, the monitoring of Salmonella infections is based on a passive surveillance scheme where the National Institute of Health (NIH) – Salmonella and Shigella Section, receives the clinical isolates suspected to be Salmonella from the diagnostic laboratories, hospitals and medical clinics across the country. For each confirmed case, the relevant clinical and epidemiological information is recorded [Reference Bangtrakulnonth12].
Data on Salmonella from other sources (e.g. animals or food) is based on sporadic and ad-hoc sampling schemes implemented by the Thai authorities [Reference Bangtrakulnonth12].
For this study, we used a dataset containing data on 29 586 Salmonella isolates collected during the period 2002–2007 from both human and other sources. Besides date, location and source of the sample, the isolate's serotype was also registered.
Database management, descriptive statistics and data arrangement were performed in SAS Enterprise Guide 3.0 (SAS Institute, USA).
Spatio-temporal scan statistics
The cluster analysis was performed using spatio-temporal statistical scan methods available at SaTScanTM v. 9.0.1 platform (Information Management Services Inc., USA) [Reference Kulldorff13].
Scan statistics analyses are usually done by moving a scanning window (with different possibilities for shapes and sizes) through the space and time dimension of the data. For each location and window size, the number of observed and expected cases is compared and any not-expected excesses on the number of observed cases registered. The statistical significance of each potential cluster is then evaluated [Reference Kulldorff8, Reference Kulldorff13]. Each cluster is determined independently so that not only the most likely cluster (MLC) is detected, but also all statistical significant clusters are as well [Reference Kulldorff13, Reference Madin14]. By doing a simultaneous spatial and temporal analysis it is possible to detect clusters that would not be apparent if looking only into one dimension. Spatio-temporal clusters can capture the mechanisms of infection spreading much more realistically [Reference Bangtrakulnonth15].
For the human isolates, a retrospective space–time permutation scan statistic method was used. This method uses only case numbers, and does not require data on the background population at risk [Reference Kulldorff16]. It only requires minimal assumptions about time and geographical location and has the advantage of adjusting automatically for natural purely spatial or temporal variation (e.g. seasonal variation) [Reference Kulldorff8]. A cluster is detected in a region, if during a specific time interval there is a high proportion of excess cases or a smaller deficiency of cases than in the neighbouring regions [Reference Kulldorff16]. The analyses were serovar-specific and performed by year (using a month as a unit of observation).
A Bernoulli retrospective scan statistic was used for the non-human isolates. This model is independent of the underlying population using the data grouped as cases vs. controls to determine if there is significant clustering of the distribution of cases compared to the distribution of controls [Reference Hyder10, Reference Warden17].
The analyses were per serovar, so the isolates from each specific serovar were defined as cases, while the isolates from the remaining serovars were defined as controls.
Regarding the human isolates, the analyses were performed by year (using a month as a unit of observation).
Other pre-defined settings common to both methods were: a scanning window with an elliptical shape, a maximum spatial cluster size of 50% of the population at risk, a maximum temporal cluster size of 50% of the study period (6 months), and no restrictions for reporting secondary clusters. A cluster was considered significant if the P value for its calculated likelihood was <0·05 and 999 Monte Carlo simulations were used for each assessment.
Besides the MLC, all significant secondary clusters were reported in the tool output. Secondary clusters with no overlap with the MLC were considered in the results. In contrast, secondary clusters that were just variations of the MLC were discarded, as the only information provided by those refers to some uncertainty on the exact boundaries (either in time or space) of the MLC.
The existence of a possible association between a human and a non-human cluster was also evaluated. We considered two clusters possibly associated if the clusters overlapped in time and space, if the clusters overlapped only in one dimension (either time or space) and were adjacent in the other or if the clusters were adjacent in both time and space.
The geographical information system ArcMap 9.0 (Environmental Systems Research Institute, USA) was used for providing spatial coordinates and for visualizing the clusters. The Google charts API tool (Google Inc., USA) was also used for cluster visualization.
RESULTS
Data description
The 29 586 isolates were collected from different sources. The human isolates accounted for 65% of the total, isolates from food sources represented 19% and animal sources 7%. Isolates from a non-descriptive ‘other’ source (e.g. environment) were 9% of the data.
Apart from the human cases, the data collected from other sources had too few isolates in each subcategory (e.g. food, animal, environmental or other) to allow a reliable analysis, so the data was aggregated into a generic non-human data category. A total of 194 and 177 different serovars were identified for the human and non-human categories, respectively (results not shown).
The data distribution throughout the study period was not uniform: 28% of the isolates were from 2002, 13% from 2003, 18% from 2004, 18% from 2005, 13% from 2006 and 11% from 2007. By performing the analyses for the complete study period as whole, the results would be driven by the year 2002, hiding possible clusters occurring later. The study period was then divided into years and the analyses run separately.
Thailand is organized into four main geographical regions: Central, Northeastern, Northern and Southern. These regions are further divided into 13 administrative zones (zones 1–12 and Bangkok), which are again further organized into 76 provinces [Reference Bangtrakulnonth12, Reference Bangtrakulnonth and Tishyadhigama18, 19]. The dataset contained data collected within 55 of the 76 provinces. Bangkok was the most represented province (34% of the isolates) followed by Ratchaburi (9%), Nonthaburi (8%), Khon Kaen (8%), Chiang Mai (6%), the remaining provinces each represented <5% of the isolates. The low number of isolates in some provinces was insufficient to allow an analysis at province level, so instead it was done at the zone level. Figure 1 shows a map of Thailand highlighting the 13 different administrative zones.
For each year, the top five most common serovars in the human category were selected for analysis (Table 1). S. Enteritidis, S. Stanley and S. Rissen were the most common serovars being present in the 6 years analysed, followed by S. Weltevreden (5 years), S. Anatum (3 years), S. Cholerasuis (2 years), S. Corvallis (1 year) and S. Typhimurium (1 year). However, in three instances, S. Cholerasuis (2006 and 2007) and S. Enteritidis (2007), there were not enough isolates in the non-human dataset to run the analysis, so the isolates of S. Corvallis (2006) and S. Weltevreden and S. Anatum (2007) were used instead.
Values in parentheses are number of isolates.
For three of the serovars (indicated by *) there were not enough isolates in the non-human dataset to run the analysis, therefore the analysis was run for the next serovar in the list that had enough available data.
Spatio-temporal scan statistics
A total of 91 human (involving 11% of the total human isolates) and 39 non-human (involving 16% of the total non-human isolates) significant spatio-temporal clusters were found distributed throughout the 6 years of data. The summarization of the results per year and serovar is shown on Table 2, while a complete description of the detected clusters for human and non-human isolates can be found in Tables 3 and 4, respectively. Figures 2 and 3 illustrate the clusters detected in 2003 for both categories.
For the number representing the geographical zones, refer to the key in Figure 1.
For the number representing the geographical zones, refer to the key in Figure 1.
In the results per year, for the human clusters, the number of detected clusters ranged from nine (2003) to 19 (2004). For the non-human clusters, the numbers were very similar between the years.
Looking at the results by serovar, it was for S. Rissen that more clusters were detected in both the human (21 clusters) and non-human (nine clusters) categories, followed by S. Weltevreden and S. Stanley each with 16 clusters in the human category and eight in the non-human group.
DISCUSSION
This study presents a retrospective spatio-temporal statistical scan analysis of Salmonella isolates in Thailand during the period 2002–2007. The objectives of the analysis were to evaluate the existence of significant clusters and possible relationships between the human and non-human clusters. We also discuss the usefulness of spatio-temporal statistical tools in analysing data with limited epidemiological information.
The need for efficient early-alert detection systems for infectious diseases is growing. Public health authorities have to be able to quickly assess a possible outbreak and take the appropriate actions to control it.
Mathematical methods that can quickly scan the data looking for unexpected patterns, in time and space, while adjusting for known covariates or risk factors (e.g. climate or for foodborne bacteria, consumption habits) are fundamental in prospective surveillance systems.
In this study, we look at not only clinical data but also data collected from common reservoirs for Salmonella.
For the analysis of human isolates, a space–time permutation model was used. The fact that the method does not require data on population at risk makes it a good choice when facing data limitations. Moreover, because it can adjust automatically for purely geographical variations (e.g. population density) and purely temporal variations (e.g. seasonal patterns) [Reference Stelling20], it reduces the effect of possible data bias. The scan detected 91 significant clusters involving 11% of the total number of cases reported. The percentage of total reported salmonellosis cases associated with outbreaks has been estimated to be between 5 and 10% in New Zealand and in Europe [21, Reference King, Lake and Campbell22]. It was not possible to further investigate each cluster to evaluate if it corresponded to a true outbreak, but the results tend to agree with previous findings.
As an example, for S. Weltevreden, most of the clusters detected occurred in coastal regions (see cluster ID nos. 15, 16, 42, 43, 57 in Table 3). This is in agreement with Hendriksen et al. [Reference Hendriksen23] who describe S. Weltevreden as being commonly associated with seafood. Moreover, for S. Rissen and S. Stanley, the results predict more clusters occurring in the central part of Thailand, which is a region associated with agriculture and more specifically intensive pig farming. Once again, in agreement with Hendriksen et al. [Reference Hendriksen24], these serovars are frequently associated with pigs.
The data collection for the non-human category was mostly based on specific monitoring initiatives rather than in an established surveillance programme. Thus, there is both over- and under-representation of regions, periods of the year and reservoirs sampled in the dataset. The Bernoulli statistical method handles the data in two separate sets: cases and controls. By defining as cases the isolates from the serovar chosen for analysis, and as controls the rest of the isolates collected in the same year, the bias on the representativeness of the data is minimized.
The choice of serovars to be analysed was based on the most frequent serovars isolated in humans per year. For years 2002–2005 (results not shown) the most common serovars found in humans were also among those most common in the non-human category. However, in 2006 for S. Cholerasuis, and in 2007 for S. Cholerasuis and S. Enteritidis, there were not enough isolates to perform the scan, even though these serovars were among the most frequent cases in humans. This could be due to changes in the reservoirs sampled (e.g. not collecting samples from eggs in 2007 which would be the most common reservoir for S. Enteritidis), or by true decreases on prevalences conjugated for instance with increases in the consumption of imported foods that were not sampled.
Statistical scan methods, such as those presented here, have been frequently used to identify clusters of disease, both infectious and chronic [Reference Kulldorff8, Reference Hyder10, Reference Alkhamis11, Reference Warden17, Reference Stelling20, Reference Coleman25–Reference Abe, Martin and Roche27].
The results showed that the methods performed well in detecting significant clusters and handling data limitations. Both the space–time permutation model and the Bernoulli model could detect not only the most likely cluster but also detected two or three significant secondary clusters for many of the serovars. This suggests that the models are sensitive enough to detect all possible significant clusters according to the defined settings of likelihood. However, the scanning window used was limited to an elliptical cylinder shape (even if with various sizes), which could result in some clusters not being detected when scanning large geographical regions like Thailand [Reference Kulldorff8].
Data on surveillance of the most common reservoirs for Salmonella can provide useful insight for the prevention or early detection of human outbreaks and should be handled together with clinical data. In this analysis, we evaluated cases where there was a geographical and temporal adjacency or overlap between human and non-human clusters for the same serovar.
In the cases where an overlap existed, the information from the non-human clusters could help in tracking the source of the outbreak. Furthermore, for cases when a non-human cluster occurred before the human cluster or in its bordering regions, the information could be used to prevent human outbreaks from occurring. This type of associations and information will be more reliable and significant as the data quality improves.
The associations detected were especially pronounced for S. Rissen. This serovar has often been linked with human infections through the consumption of pork products [Reference Hendriksen23], which are very popular in Thai cuisine. More details on the source of the S. Rissen isolates could confirm the association detected.
Regarding other possible bias in the data collection, the decrease in the number of isolates throughout the study period could be explained by a real decrease in the number of cases due to surveillance changes and improvements, but it should also reflect changes in the data collection process. Demographic differences should also be taken into account, e.g. the fact that Bangkok and its bordering regions are much more populated than the northern part of Thailand, or that access to medical facilities is more difficult in rural areas than in the cities. Similarly, the main economic activity of each region influences the number of submitted isolates, for instance coastal areas have more tourism and during seasonal peaks population numbers can increase markedly.
The amount and distribution of available data forced the analysis to be run aggregating the isolates from non-human sources in one generic group. A more complete and systematic data collection would allow the analysis to be run accounting for specific source (food, animal or even more detailed as food type or animal species) and resulting in more definitive conclusions about possible associations detected. Nevertheless, these types of data restrictions reflect the reality for most countries, where a systematic and integrated surveillance system is not yet in place. It is important to develop methods that can work within these limitations and still have a reasonable predictive ability.
Further epidemiological investigations are needed to determine whether clusters represent real outbreaks or if they are a result of a temporary increase in the prevalence of endemic strains. Still, a significant cluster shows that some changes have occurred either in reservoirs or in the human population and this may require appropriate action by the relevant authorities. The observed temporal association between clusters appearing in non-human reservoirs prior to cases in humans could be used as warnings for the authorities and allow preventive actions to be taken before cases occur in humans.
The statistical analysis was done retrospectively using data from 2002 to 2007, but it still provides important indications on how the tools work and how limitations, such as the low amount of data collected and the fact that it is geographically unevenly dispersed over a large territory like Thailand, are handled. This can be useful for future adaptations of the methods to work in real-time and function as an outbreak detection tool.
In this study, spatio-temporal scan statistics proved to be an efficient and user-friendly platform for running a retrospective cluster analysis. The SatScan software comprises different statistical methods that can be adequate for different types of data and limitations as shown in this study. The use of approaches like the one presented here could provide methodological support to contribute to the implementation of efficient strategies to control and prevent Salmonella infections.
ACKNOWLEDGEMENTS
This study was supported by the Center for Genomic Epidemiology (09-067103/DSF) (www.genomicepidemiology.org).
DECLARATION OF INTEREST
None.