Diet is a key modifiable risk factor, but the exploration of its role in disease occurrence is complicated because of methodological issues related to the dietary assessment method used( Reference Bingham, Luben and Welch 1 – Reference Willett 3 ), food and nutrient interactions( Reference Jacobs and Steffen 4 , Reference Messina, Lampe and Birt 5 ) and differences in food consumption across populations( Reference Irala-Estevez, Groth and Johansson 6 – Reference Teufel 8 ). Traditionally, nutritionists and researchers have explored the effect of individual dietary factors in disease occurrence. However, some authors advocate the use of dietary patterns instead of individual foods and nutrients, arguing that they may better capture variability in the population’s diet, while allowing the evaluation of interactions between dietary factors( Reference Barkoukis 9 – Reference Jacques and Tucker 11 ).
These patterns can be identified with data-driven methods such as principal component analysis (PCA), factor analysis (FA) and cluster analysis or can be represented by investigator-driven patterns known as dietary quality indices. Investigator-driven patterns assign a set of scores based on individuals’ fulfilment of a set of fixed recommendations. Therefore, they are widely applicable, facilitating the exploration of the reproducibility of their association with different diseases in independent populations( Reference George, Ballard-Barbash and Manson 12 – Reference Reedy, Krebs-Smith and Miller 16 ). However, they present the disadvantage of being very disease dependent, given that they are mainly based on existing evidence of the association between diet and CVD( Reference Fung, McCullough and Newby 17 ). On the other hand, data-driven dietary patterns are more representative of the diet of the specific population from which they have been extracted and independent of the diseases, but many authors argue that the patterns obtained are very population-dependent, and therefore difficult to reproduce in other settings( Reference Jacques and Tucker 11 , Reference Martinez, Marshall and Sechrest 18 , Reference Slattery and Boucher 19 ). The reproducibility of data-driven dietary patterns has been assessed previously by various authors using dietary information obtained with common assessment tools at different moments of time within the same sample( Reference Hu, Rimm and Smith-Warner 20 – Reference Newby, Weismayer and Akesson 23 ). However, no previous studies have explored the reproducibility of data-driven dietary patterns extracted from different samples.
The objective of this study was to assess the reproducibility of data-driven dietary patterns in different samples extracted from similar populations. We compared the results from a previous case–control study Epidemiological study of the Spanish group for breast cancer research (GEICAM: grupo Español de investigación en cáncer de mama) on diet and female breast cancer (BC) in Spain( Reference Castelló, Pollan and Buijsse 24 ) with those obtained from a sample of Spanish women attending BC screening programmes (Determinantes de la Densidad Mamográfica en España – Determinants of Mammographic Density in Spain (DDM-Spain)), by evaluating the correlation between pattern scores and the congruence between the composition of patterns in both populations.
Methods
Study population and data collection
We used information on three dietary patterns obtained from a previous case–control study on female BC (EpiGEICAM study) using the dietary intake data of 973 healthy participants, aged 22–71 years, and recruited from fourteen Spanish provinces during the period 2006–2011( Reference Castelló, Pollan and Buijsse 24 ). These patterns will be used as a reference to explore their reproducibility in a different sample using data from the DDM-Spain participants. DDM-Spain is a cross-sectional, multicentre study carried out in seven screening centres belonging to the Spanish BC screening network and located throughout the Spanish peninsula( Reference Lope, Perez-Gomez and Sanchez-Contador 25 , Reference Pollan, Lope and Miranda-Garcia 26 ). In Spain, all women aged 50–69 years (45–69 years in some regions), regardless of nationality or legal status, are invited to be screened under these government-sponsored programmes every 2 years. Women were randomly selected among all screening attendants and invited to participate on a daily basis until the minimum sample size of 500 for each centre was reached. A total of 3550 women were recruited between 2007 and 2008, with an average participation rate of 74·5 % (range 64·7–84·0 % across centres). Women were interviewed at the screening centres by trained interviewers who collected demographic, anthropometric, physical activity, gynaecologic, obstetric and occupational data, as well as family and personal history (including weight and height at age 18 years). Information on smoking included current status and months since quitting for ex-smokers. Current smokers were defined as women who smoked at the time of mammography or had quit <6 months before. Dietary intake during the preceding year was collected using a validated 117-item FFQ( Reference Vioque, Navarrete-Munoz and Gimenez-Monzo 27 , Reference Willett, Sampson and Stampfer 28 ). Postmenopausal status was defined as self-reported absence of menstruation in the previous 12 months. Interviewers measured weight, height and waist and hip circumferences twice using the same protocol and identical balance scales, stadiometers and measuring tapes. A third measure was taken when the first two were not equal.
The DDM-Spain study was conducted according to the guidelines laid down in the Declaration of Helsinki, and all procedures involving human subjects were approved by the bioethics and animal welfare committee at the Carlos III Institute of Health. All participants signed a consent form, including permission to publish results from the current research.
Dietary patterns
The FFQ used in both studies were designed to assess the whole diet, had similar structures and were based on a validated FFQ( Reference Vioque, Navarrete-Munoz and Gimenez-Monzo 27 , Reference Willett, Sampson and Stampfer 28 ). However, the FFQ of the DDM-Spain study included some additional food items that were not contained in the FFQ of the EpiGEICAM study( Reference Lope, Perez-Gomez and Sanchez-Contador 25 , Reference Pollan, Lope and Miranda-Garcia 26 ): the FFQ used in the EpiGEICAM study contained ninety-nine items from which eighty-six were used to create the food groups (after excluding the non-energetic and alcoholic beverages), whereas the FFQ from DDM-Spain included 117 items (the same ninety-nine from DDM-Spain plus eighteen additional foods) from which ninety-nine were used to create the food groups (after excluding non-energetic and alcoholic beverages). In both cases, the dietary information collected was grouped into the exact same twenty-six food groups that are summarised in Table 1, where the items only included in the DDM-Spain study are represented in italics.
* Log-transformed intake in grams.
† Weighted within the high- and low-fat dairy product categories according to the consumption of whole, semi-skimmed and skimmed milk.
w1=whole/(whole+semi-skimmed+skimmed).
w2=(semi-skimmed+skimmed)/(whole+semi-skimmed+skimmed).
w1 and w2 were 0·5 if consumption was 0 g for whole, semi-skimmed and skimmed milk.
‡ In The additional items included only in the FFQ from the Determinants of Mammographic Density in Spain study that were not collected in the FFQ from the EpiGEICAM study are italic.
§ All the n-3-enriched milk brands that have been consulted are skimmed or semi-skimmed.
The EpiGEICAM study identified three dietary patterns over twenty-six food groups: a Western pattern characterised by elevated intakes of high-fat dairy products, processed meat, refined grains, sweets, energetic drinks and other convenience foods and sauces and by low intakes of low-fat dairy products and whole grains; a Prudent pattern defined by high intakes of low-fat dairy products, vegetables, fruits, whole grains and juices; and a Mediterranean pattern represented by a high intake of fish, vegetables, legumes, boiled potatoes, fruits, olives and vegetable oil and a low intake of juices. These patterns explained 16, 13 and 8 % of the total variability in food intake, respectively( Reference Castelló, Pollan and Buijsse 24 ). We assessed the reproducibility of these three patterns by comparing them with the patterns extracted by applying the same PCA analysis to the same twenty-six food groups from the DDM-Spain sample.
Statistical analysis
Major existing dietary patterns were identified in the DDM-Spain sample using the same technique applied to the EpiGEICAM data( Reference Castelló, Pollan and Buijsse 24 ): applying PCA without rotation to the variance–covariance matrix over twenty-six inter-correlated food groups that were reduced to a set of principal components (dietary patterns in this case). The first few components with eigenvalues >1 were selected for initial exploration. The PCA reports, for a given pattern, a set of weights associated with each food group (commonly called component/pattern weights) that is used to calculate pattern scores, defined, for each individual, as a weighted sum of the food group consumption. Afterwards, these scores were correlated with the food group consumption to calculate the pattern loadings, which indicate the importance of individual food groups in each pattern. Pattern weights and pattern loadings give similar information, except that they are measured on different scales (weights are standardised into Z score form)( Reference Burt 29 ). As only information on pattern loadings was provided by the EpiGEICAM study, these were used to compare dietary patterns from both studies. For comparison purposes, we considered that food groups with pattern loadings ≥|0·3| were the main contributors to a dietary pattern.
To evaluate the level of agreement between the food composition of patterns extracted in the DDM-Spain study and those reported in the EpiGEICAM study, we calculated congruence coefficients (CC)( Reference Burt 29 , Reference Tucker 30 ) between the pattern loadings from both studies. CC represents the correlation between pattern loadings based on their deviations from 0 (instead of being based on the deviations from the mean of the factor loadings as the Pearson’s correlation is) and it is the preferred measure for component/factor similarity extracted with PCA/FA( Reference Haven and Berge 31 ). CC ranges from −1 to 1, and a value in the range 0·85–0·94 corresponds to fair similarity, whereas a value ≥0·95 implies that the two compared components/factors can be considered equivalent( Reference Haven and Berge 31 – Reference Nesselroade and Baltes 33 ).
The CC between the pattern loadings of a given pattern from EpiGEICAM (l 1j ) and the pattern loadings of a given pattern from DDM-Spain (l 2j ) for each of the j=1, … ,26 food groups were calculated as follows:
In addition, to follow the same methodology commonly used in studies exploring the reproducibility of dietary patterns, Spearman’s correlation coefficients (Corr) between the EpiGEICAM and the DDM-Spain pattern scores were calculated. For that purpose, patterns scores (which reflect the level of compliance of each woman with each one of the dietary patterns) were calculated as the linear combination of consumption of food groups weighted by the pattern loadings from EpiGEICAM Western, Prudent and Mediterranean patterns and from the set of selected patterns resulting from applying PCA to the DDM-Spain data as follows( Reference Schulze, Hoffmann and Kroke 34 ):
where P is the pattern score, L the loading score, C the centred food consumption, k the Western, Prudent and Mediterranean patterns from EpiGEICAM and Western, Prudent and Mediterranean patterns from DDM-Spain, i=1, …, 3550 women and j=1, …, 26 food groups.
CC is the preferred measure for component/factor similarity extracted with PCA/FA because its validity is supported by methodological research( Reference Haven and Berge 31 – Reference Nesselroade and Baltes 33 ). In addition, a recent study has questioned the ability of using solely Pearson’s correlation (Corr) coefficient to assess pattern similarity( Reference Castello, Buijsse and Martin 35 ). However, the majority of studies exploring the reproducibility of dietary patterns base their conclusions on the latter measure, considering any significant correlation as being indicative of pattern similarity regardless of its value( Reference Hu, Rimm and Smith-Warner 20 – Reference Newby, Weismayer and Akesson 23 ). In this study, we provide the correlation coefficient for the sake of comparability with published data, but we will base our final conclusion regarding pattern reproducibility on the CC.
To take into account sampling variability in the estimation of pattern loadings using DDM-Spain data, and subsequently in the estimation of the agreement measurements between the patterns identified within the EpiGEICAM and the DDM-Spain studies, we performed a non-parametric bootstrap estimation with 5000 replications. Using sampling replacement, the bootstrap obtained 5000 replicates of the original DDM-Spain data set. PCA was then applied in each replication, and the three principal components that proved to be more similar to those reported in the EpiGEICAM were selected on the basis of the distance between the pattern loadings (more details are given in the online Supplementary Method 1). The 95 % percentile CI for each parameter were represented by percentiles 2·5 and 97·5 of the 5000 bootstrap point estimates’ distribution.
Similar analyses were carried out by applying the PCA to food groups from the DDM-Spain study, which included the same exact eighty-six items considered in the EpiGEICAM analysis (online Supplementary Table S1 and Fig. S1).
Analyses were performed using STATA/MP 14.0.
Results
The anthropometric, reproductive and socio-demographic characteristics of the EpiGEICAM controls( Reference Castelló, Pollan and Buijsse 24 ) and DDM-Spain women are summarised in Table 2. The DDM-Spain study recruited a higher percentage of older and postmenopausal women (77 v. 47 %), women with higher energy intake (on average 656 kJ/d (157 kcal/d) more in the DDM-Spain group), women with higher BMI and a higher percentage of women who practised physical activity with moderate-to-vigorous intensity (76 v. 63 %). On the other hand, these women reported lower intake of alcohol, lower educational level (34 % with primary school or less in DDM and 16 % in EpiGEICAM), lower percentage of family history of BC (7 v. 20 %), lower age at first delivery (43 % of parous women in the DDM had their first child before 25 years of age, whereas this proportion was 26 % in EpiGEICAM) and there was a lower percentage of nulliparous (9 v. 23 %) women. The distribution of age at menarche and smoking appeared to be fairly similar in both studies.
v.e., Total variability in food group intakes explained by the pattern; BC, breast cancer.
* Descriptive data extracted from the scientific article of Castello et al.( Reference Castelló, Pollan and Buijsse 24 ).
† As distribution of the prudent score was skewed, the median and IQR were used to describe this score.
Fig. 1–3 show the comparison between the original loadings from the EpiGEICAM study with their corresponding values in the DDM-Spain study. Western patterns from both studies were characterised by high intakes of high-fat dairy products, refined grains, energetic drinks and convenience food and sauces and low intakes of low-fat dairy products and whole grains. Correlations with the intake of red and/or processed meat and with sweets were also close to the 0·3 threshold. Moreover, the DDM-Spain Western pattern seemed to be negatively correlated with the consumption of white fish, a result that was not observed in EpiGEICAM. Despite these small differences, the elevated CC between patterns (CC=0·90) indicates a fair similarity between the Western patterns extracted from the EpiGEICAM and the DDM-Spain data (Fig. 1).
We did not identify a pattern among women of the DDM-Spain study that was highly congruent with the EpiGEICAM Prudent pattern. The most similar pattern presented a high consumption of whole grains and juices but failed to correlate with low-fat dairy products, vegetables and fruits (Fig. 2). Something similar was observed with the Mediterranean pattern: several high correlations were observed with some vegetables, legumes, potatoes and nuts. However, the pattern from the DDM-Spain study did not include other typical factors of the Mediterranean diet, such as fish, olive oil and fruits (even if pattern loadings for these food groups were not low), whereas other foods more common in the Prudent diet, such as low-fat dairy products, or in the Western diet, such as sweets, and sugary and convenience foods, were included with high correlations. According to the CC (0·77), the EpiGEICAM and the DDM-Spain Mediterranean patterns cannot be considered similar (Fig. 3).
Finally, had we considered any significant correlation as being indicative of similarity, we would have concluded that all patterns extracted from the EpiGEICAM data were reproducible in the DDM-Spain study.
Discussion
To the best of our knowledge, this is the first study exploring the reproducibility of data-driven patterns in two different samples extracted from similar populations. We were able to reproduce the Western pattern identified in women from the EpiGEICAM study among women attending BC screening programmes who participated in the DDM-Spain study. However, the reproducibility of the Prudent and Mediterranean patterns cannot be considered good.
The association between dietary patterns and BC has been explored in many studies in different settings. Most of these studies identified a Western/Unhealthy pattern, which shares the most important characteristics with the Western patterns identified in EpiGEICAM and DDM-Spain, such as high consumption of fatty dairy products, red/processed meat, refined grains, sweets and convenience foods( Reference Agurs-Collins, Rosenberg and Makambi 36 – Reference Wu, Yu and Tseng 41 ). However, the Mediterranean and Prudent patterns have often been mixed under the names of Vegetable, Prudent, Healthy or Mediterranean diet. These patterns are characterised by a high consumption of vegetables and fruits( Reference Agurs-Collins, Rosenberg and Makambi 36 – Reference Zhang, Ho and Fu 47 ) that are an important part of the Mediterranean diet, but fail to include other items such as olive oil( Reference Agurs-Collins, Rosenberg and Makambi 36 , Reference Cui, Dai and Tseng 38 – Reference Wu, Yu and Tseng 41 , Reference De Stefani, Deneo-Pellegrini and Boffetta 44 – Reference Zhang, Ho and Fu 47 ), nuts( Reference Agurs-Collins, Rosenberg and Makambi 36 – Reference Wu, Yu and Tseng 41 , Reference Bessaoud, Tretarre and Daures 43 – Reference Zhang, Ho and Fu 47 ), legumes( Reference Cottet, Touvier and Fournier 37 , Reference Terry, Suzuki and Hu 39 – Reference Wu, Yu and Tseng 41 , Reference De Stefani, Deneo-Pellegrini and Boffetta 44 , Reference Hirose, Matsuo and Iwata 46 , Reference Zhang, Ho and Fu 47 ) or fish( Reference Cui, Dai and Tseng 38 , Reference Wu, Yu and Tseng 41 ), which are key foods to differentiate the so-called Prudent or Healthy patterns from the Mediterranean.
None of the above-mentioned studies have been able to identify both, a Prudent and a Mediterranean pattern in the same population, probably reflecting the difficulty in differentiating them in contexts where the Mediterranean diet is not very prevalent. On the other hand, the higher agreement in the definition of a Western pattern across studies is consistent with the greater reproducibility of this pattern observed in our study.
As noted earlier in this study, PCA reduces a set of inter-correlated variables to a group of principal components (dietary patterns in this case) so that the maximum correlation between the variables within components and the minimum correlation among components are obtained( Reference Rencher 48 ). Therefore, the greater the variability in diet, the easier it will be to find clearly differentiated independent patterns. In our study, although EpiGEICAM included women from fourteen Spanish provinces (four of them on the Mediterranean coast), DDM-Spain participants were recruited from screening centres located in seven provinces (three of them located on the Mediterranean coast). Therefore, the greater geographical distribution in the EpiGEICAM study may imply a greater representativeness of all diets across the Spanish territory. In addition, distribution of age among DDM-Spain women was more homogeneous (range=45–69) than that observed in the EpiGEICAM participants (range=22–71). As García-Arenzana et al.( Reference García-Arenzana, Navarrete-Munoz and Peris 49 ) previously described, older women tend to have healthier dietary habits than younger women, which may have produced a more heterogeneous distribution of dietary habits in the EpiGEICAM study. This heterogeneity might have facilitated the identification of more specific patterns, not only limited to the discrimination of two antagonistic patterns (Western v. Healthy/Prudent/Mediterranean) but also allowing the clear differentiation of patterns with subtle differences, such as the Prudent and Mediterranean patterns.
Regarding the pre-established thresholds for the CC that define the similarity of dietary patterns in both studies, we based our decision on three published pieces of research that evaluated concordance coefficients in light of the subjective opinion of several experienced researchers judging the equivalence between different components( Reference Haven and Berge 31 – Reference Nesselroade and Baltes 33 ). Haven and Nesselroade( Reference Haven and Berge 31 , Reference Nesselroade and Baltes 33 ) argue that values over 0·80 are enough to assume fair similarity between components, whereas Lorenzo-Seva & Berge( Reference Lorenzo-Seva and Berge 32 ) maintain a more conservative approach setting the cut-off point for fair similarity at 0·85 and preventing a CC below this value from being interpreted as indicative of similarity. All three articles agree on the difficulty in setting up a cut-off point under which patterns should be considered clearly different. Despite the fact that the CC is considered a good measure of agreement between components or factors extracted with PCA or FA( Reference Haven and Berge 31 – Reference Nesselroade and Baltes 33 ), the existing bibliography evaluating the reproducibility of data-driven dietary patterns does not use this measure and bases its conclusions only on the correlations between pattern scores, considering any significant correlation as being indicative of similarity regardless of its value( Reference Hu, Rimm and Smith-Warner 20 – Reference Newby, Weismayer and Akesson 23 ), which can be as low as 0·27( Reference Newby, Weismayer and Akesson 23 ). In our case, the correlations were significant and high for all three patterns (Fig. 1–3). However, according to the CC, only the Western pattern can be considered fairly similar between studies, which highlights the arbitrariness of the significance of the linear correlation to define pattern similarity and the need to choose an appropriate measure and a concrete threshold for such a measure to determine the level of congruence between patterns. In this regard, we have recently explored the applicability of previously reported dietary patterns in a different setting and we found that, for CC between pattern loadings ≥0·82 or correlations between pattern scores ≥0·57, patterns not only appear to have a very similar composition but also are similarly associated with BC risk( Reference Castello, Buijsse and Martin 35 ). The same direction of the associations but loss of significance was observed for values of the CC between pattern loadings ≤0·77 and values of the correlation between pattern scores ≤0·52. In the present study, taking into account only the methodological studies published regarding the threshold of the CC for pattern similarity( Reference Haven and Berge 31 – Reference Nesselroade and Baltes 33 ), we followed the most conservative approach and considered dietary patterns to be fairly similar if CC values were ≥0·85.
A major limitation of the use of dietary patterns is the potential for subjective interpretations by the investigator to be introduced at various stages of the dietary patterns’ construction. Subjective decisions that might affect the comparability between studies are as follows: which foods should be included in each of the defined groups, the thresholds chosen to determine the contribution of food groups to the identified dietary patterns and the assignation of a label to each of these patterns( Reference Barkoukis 9 – Reference Jacques and Tucker 11 , Reference Martinez, Marshall and Sechrest 18 , Reference Slattery and Boucher 19 ). However, we have demonstrated that this limitation can be overcome by a detailed analysis when comprehensive information on food grouping and loadings is provided by Castello et al.( Reference Castello, Buijsse and Martin 35 ). On the other hand, both FFQ from EpiGEICAM and DDM-Spain collected information on ninety-nine identical foods, except for the fact that DDM-Spain included eighteen additional foods that were not included in EpiGEICAM. In addition, the same group of researchers took principal responsibility for the analysis of the data; therefore, food grouping and labelling were very similar in both studies.
Finally, we summarise the main strengths of the present study. As previously mentioned, various studies have assessed the reproducibility of investigator-driven patterns( Reference George, Ballard-Barbash and Manson 12 – Reference Reedy, Krebs-Smith and Miller 16 ). The reproducibility of data-driven dietary patterns extracted from the same sample using the dietary information obtained with different assessment tools or in different time points( Reference Hu, Rimm and Smith-Warner 20 – Reference Newby, Weismayer and Akesson 23 ) has also been explored. However, to our knowledge, this is the first study assessing the reproducibility of data-driven dietary patterns in different samples from similar populations and the first using the CC to evaluate their similarity. In addition, most of the published studies on reproducibility of data-driven dietary patterns based their conclusions on limited sample sizes that ranged from 124–498( Reference Hu, Rimm and Smith-Warner 20 – Reference Nanri, Shimazu and Ishihara 22 ). Dietary patterns from EpiGEICAM were extracted over 973 healthy women, and for DDM-Spain the sample size was 3550, a size only exceeded by the Newby et al. study( Reference Newby, Weismayer and Akesson 23 ).
Conclusions
The reproducibility of widely prevalent dietary patterns such as the Western pattern is better than the reproducibility of patterns more specific to certain populations, such as the Mediterranean. More methodological studies exploring the reproducibility of dietary patterns are needed to establish a more objective threshold for the CC between pattern loadings and their equivalent Corr between pattern scores that define pattern similarity.
Acknowledgements
The authors thank the DDM-Spain study participants for their contribution to breast cancer research and all collaborator researchers: Pilar Moreo, María Pilar Moreno, María Soledad Abad, Francisca Collado, Francisco Casanova, Jose Antonio Vázquez, Nieves Ascunce, Milagros García, Manuela Alcaraz, María Soledad Laso, Josefa Miranda and Francisco Ruiz Perales.
This study was supported by Carlos III Institute of Health FIS (Spanish Public Health Research Fund: PI060386 FIS; PS09/00790 and PI15CIII/0029 research grants), the Spanish Ministry of Health (EC11-273), the Spanish Ministry of Economy and Competitiveness (IJCI-2014-20900), the Spanish Federation of Breast Cancer Patients (FECMA: EPY 1169-10) and the Association of Women with Breast Cancer from Elche (AMACMEC: EPY 1394/15). None of the funders had any role in the design, analysis or writing of this article.
V. L., N. A., B. P.-G. and M. P. designed the study; A. C., J. V., C. S., C. P.-P., S. A., M. E., D. S.-T., C. V. and C. S.-C. collected the data and/or prepared the database. A. C. performed statistical analysis and wrote the initial version of the manuscript that M. P. revised and corrected in its different versions. All the authors have read and approved the final version of the manuscript.
The authors declare that there are no conflicts of interest.
Supplementary material
For supplementary material/s referred to in this article, please visit http://dx.doi.org/10.1017/S000711451600252X