FFQ are commonly used in large-scale nutritional epidemiology studies, but some FFQ do not have questions about portion sizes( Reference Osler, Heitmann and Gerdes 1 – Reference Bazzano, He and Ogden 3 ). Details concerning portion sizes or missing portion size values are rarely accounted for in scientific publications, but when calculating the dietary intake from an FFQ, standard portion sizes are often applied.
The absence of portion size questions in an FFQ can be regarded as a missing data problem. Using standard portion sizes is methodologically equivalent to applying median portion sizes for all subjects. These may be sex-specific, but the size of portions depends on several other factors than sex such as age, BMI and physical activity( Reference Noethlings, Hoffmann and Bergmann 4 ). Hence, the standard portion size used may well be the same for a young physically active man as it is for an elderly sedentary man.
Substituting unknown portion sizes with standard sizes may thus under- or overestimate the ‘true’ intake in certain segments of the population( Reference Greenland and Finkle 5 – Reference Rubin and Schenker 7 ). It is now well recognized that missing data are most rationally accounted for through multiple imputation techniques, rather than with deterministic imputations like medians, to avoid flawed (too narrow) confidence intervals( Reference Rubin 8 , Reference Sterne, White and Carlin 9 ). Multiple imputation requires an adequate method for imputation, i.e. a method with error and bias as low as possible.
In the present paper we describe how physiologically meaningful portion sizes can be estimated from information on age, sex, physical activity, weight and height by imputation from participants with complete data or from another FFQ data set with portion sizes (from a comparable population). We invented the ‘comparable categories’ method (Coca) and improved the ‘k-nearest neighbours’ (KNN) and the multinomial regression (MLR) methods by making them suitable for multiple imputation. The basic idea of these advanced imputation methods is that instead of using a median value for substituting missing data, one may condition on other information available in the data set to better estimate a reasonable portion size.
In the present study the dietary intake computed with standard portion sizes (the sex-specific median values), or with portion sizes determined by the MLR, Coca or KNN method, was compared with a reference dietary intake, which was computed with the originally self-reported portion sizes that were quantified by a photographic food atlas embedded in the FFQ.
Experimental methods
The Danish Health Examination Survey collected dietary data from 18 065 adult Danes in 2007–2008 using an Internet-based, 267-item FFQ( Reference Eriksen, Gronbaek and Helge 10 ). This diet inventory has been used in many Danish population studies( Reference Tjonneland, Haraldsdottir and Overvad 2 , 11 ). In the Danish Health Examination Survey, the FFQ was extended with a photographic food atlas consisting of eleven picture series placed at the end of the questionnaire in order to quantify the portion sizes( 11 ). The portion size food atlas was developed by the Danish Veterinary and Food Administration. The picture series covered thirty-nine items (foods or meals) classified into four or six portions of varying sizes. For instance, six photos showed increasing serving sizes of corn flakes in a bowl and the accompanying portion size item was used to quantify all cereal frequency items (muesli, etc.). Another series with six photos of increasing serving sizes of a meat main meal was accompanied by five portion size items covering hamburger steak, steak, beef, fish or poultry. The remaining series of photographs covered bread, toppings for rye bread (eight items), toppings for white bread (eight items), warm stew with meat (three items), potatoes (four items), pasta, rice, vegetable dishes (four items), mixed salad, chocolate and candy. The actual weight in grams of the food on the picture was multiplied with the frequency to obtain the total intake of the food. Leisure-time physical activity was self-reported with the International Physical Activity Questionnaire in four classes, where class 1 was hard training multiple times per week and class 4 was inactive behaviour( Reference Ekelund, Sepp and Brage 12 ). We defined classes 1+2 as active and classes 3+4 as sedentary. Anthropometric measures were obtained by clinical examination in 9384 subjects. The present study population consisted of the 3728 subjects with complete information on anthropometry and portion sizes (no missing values). The characteristics of the study participants are described in Table 1. The involved institutions’ review boards have approved the study proposal.
Statistical methods
We analysed four methods of imputing portion size. The subjects were randomly divided (SAS procedure: proc surveyselect) into two data sets: (i) a learning data set A (n 1864) for generating data for imputation; and (ii) a test data set B (n 1864) for analysing the validity of the imputed data. For data set B the ‘mean daily total energy intake’ (TE) was computed with the complete set of authentic self-reported portion sizes and this TE served as the reference.
The population sex-specific medians were used as standard portion sizes. With each of the three stochastic imputation methods, we imputed portion sizes from data set A to data set B and used these estimated portion sizes to compute a new TE. This was done ten times (on different splits of the data) and subsequently ten TE values were computed with each imputation method.
The mean TE from each imputation method was then compared with the reference TE by determining the bias (defined as the mean error) and the root-mean-square error (RMSE). In the present paper the ‘error’ is defined as the reference value minus the estimated value. Spearman’s ρ was used to compare the ranking of the subjects, comparing the reference TE with the TE calculated with imputed portion sizes. T statistics were used to determine the bias in TE related to TE (Fig. 1). Energy and nutrient intakes were computed with FoodCalc® ( 13 ) and the Danish national food composition tables( 14 ).
The four imputation methods were:
-
1. The ‘median’ method or ‘standard portion sizes’. Imputation of median values is equivalent to applying a standard portion size as it implies uniform portion sizes for all subjects (here thirty-nine medians, one for each of the thirty-nine portion size items). In this model we used the sex-specific median values from the entire sample (from data sets A+B) to define thirty-nine sex-specific standard portion sizes in data set B (using the sex-specific median from data set A only would induce bias as explained in the online supplementary material, chapter 4). Based on earlier reports and physiological reasoning we hypothesized that portion sizes depend on age, sex, physical activity, weight and height( Reference Noethlings, Hoffmann and Bergmann 4 , Reference Clapp, McPherson and Reed 6 ). Individual data on these five variables are readily available in most epidemiological studies and they informed the following three, more advanced imputation methods that are all based on stochastic principles:
-
2. The ‘comparable categories’ (Coca) method. The subjects were divided into thirty-two categories. Supplemental Table S1 in the online supplementary material demonstrates how the categories were created by first dividing the subjects by level of physical activity (into active or sedentary), then dichotomized on approximate median values of height (166 cm), then divided by sex, split on rough median values of weight (74 kg) and age (48 years). Each of these categories contains individuals sharing approximately the same physiological characteristics, e.g. in category 13 everyone was sedentary, >166 cm, female, <74 kg and <48 years. For each subject in data set B, the portion sizes were substituted by a complete set of portion sizes from one random subject in the ‘comparable category’ in data set A.
-
3. The ‘k-nearest neighbours’ (KNN) method( Reference Parr, Hjartaker and Scheel 15 ). A missing portion size in data set B was substituted by a random value from the k (a predefined number) most similar observations (‘neighbours’) in data set A. The similarity is defined as the proximity measured by Euclidean distance between the informing variables (here age, sex, physical activity, weight and height). While traditional KNN would impute the portion size most prevalent among the k neighbours, our version of KNN imputed a random value among the k neighbours with probability proportional to the proximity, making it suitable for multiple imputation. k>20 yielded no extra accuracy.
-
4. The ‘multinomial logistic regression’ (MLR) method. MRL models were constructed based on data set A: age, weight and height were continuous covariates, sex and physical activity were categorical covariates, and the portion sizes were the categorical outcomes. Portion sizes in data set B were determined by probability sampling from the prevalence of the categorical portion size values obtained by inserting the data set B values for age, weight, height, sex and level of physical activity in the regression model.
The set-up was run in the SAS statistical software package version 9·2, but the methods can be applied on any type of software. SAS codes for KNN, MLR, Coca and a wrapper for (linear) regression analysis combining the results from multiple imputed (by any method) data sets are given in the online supplementary material.
Results
More women than men participated in the Danish Health Examination Survey. The subjects included in the present study were a little younger than the excluded subjects. Furthermore, the included men were more active and the included women were slightly heavier. However, differences were numerically small (Table 1).
Overall, compared with the reference energy intakes, the RMSE were equally low with the median and MLR methods, and equally high with Coca and KNN. The bias of the median method was numerically larger than in any of the other methods (Table 2). KNN had a negative bias in men (overestimating the portion sizes), but a positive bias in women (underestimating the portion sizes). The biases of MLR and Coca were equally low in both men and women.
RMSE, root-mean-square error; bias, mean error; median, sex-specific median imputation which is equivalent to using sex-specific standard portion sizes; Coca, ‘comparable categories’; KNN, k-nearest neighbours; MLR, multinomial logistic regression; Ref., referent category.
The four methods were compared by their ability to predict the reference. The reference energy intakes were computed with a set of complete reported portion sizes. The results presented are mean values of ten imputations with each method (on random splits of the data). Note that a positive bias indicates an underestimation of the reference and a negative bias indicates an overestimation.
More results are presented in the online supplementary material (Supplemental Table S2), including ‘non sex-specific’ standard portion sizes and different versions of Coca (with different informing variables and less categories). Results with selected micronutrients and macronutrient subtypes were essentially similar to the analyses of macronutrients (results not shown).
All of the methods had high Spearman’s rank correlation, but median and MLR imputation performed slightly better than KNN and Coca. All correlations were >0·90 and all confidence intervals between 0·89 and 0·97 (see online supplementary material, Supplemental Table S3).
Figure 1 illustrates how all methods resulted in a bias of TE dependent on TE, i.e. an underestimation of TE in subjects with a high energy intake and an overestimation of TE in subjects with a low energy intake. The magnitude of this bias (the T value) was markedly higher with median imputation than with the other methods. Figure 2 shows that when stratifying by BMI group, age group and physical activity class, a larger variation was seen among men than women regarding the accuracy of the imputation methods. The mean total energy intake was 12·5 MJ calculated with maximum portion sizes for all and 7·5 MJ with minimum portion sizes for all. Thus, up to 40 % of the calculated energy intake was potentially determined by the portion sizes. However, Fig. 2 indicates that the mean energy intakes calculated differed by up to 2 MJ (18 %) in men between the methods and by to 0·75 MJ (9 %) in women.
Discussion
Overall, the MLR method provided the best agreement with the reference dietary intake. However, the differences between the stochastic methods were small and the confidence intervals of the bias in MLR and Coca were overlapping in most segments of the data. In MLR and Coca the bias did not differ substantially between men and women, whereas in KNN the bias was negative in men and positive in women. The median method (equivalent to sex-specific standard portion sizes) had relatively low RMSE but was inferior to the other methods in terms of bias. All of the methods underestimated the reference dietary intake, except KNN that overestimated the portion sizes in men. The use of standard portion sizes systematically underestimated the energy intake of subjects with large portion sizes; a bias that diminished, for instance, differences in dietary intake between age groups. For example, a young man was assigned the same standard portion size as an elderly man even though we know that age is a determinant of energy intake as demonstrated in Fig. 2 and by the fact that age is an input variable in calculating the BMR( Reference Frankenfield, Roth-Yousey and Compher 16 ). This bias may well affect parameter estimates in multivariate analyses( Reference Eekhout, de Vet and Twisk 17 ). On the other hand, the median method performed better than the other methods in Spearman’s rank test. However, the confidence intervals were overlapping with MLR, and Coca and KNN also had high correlations with the reference energy intake.
Figure 2 demonstrates how all imputation methods were better in predicting portion sizes in women than in men. The greater variation in men is in part explained by the higher energy intake, but probably also by a greater variation in portion sizes in men.
Evaluation of the methods
We used ‘sex-specific median imputation’ as ‘standard portions’. Standard portions can of course be defined differently, but any deterministic portion size will contain the same sort of bias and the median sizes were probably a reasonable choice.
The simple Coca method worked surprisingly well and, compared with the other stochastic methods, the computer run time was much faster. Depending on the size of the learning data set and the number of categories, empty or tiny categories may occur. This can be solved by fitting cut-off values in the dichotomization or by merging related categories. The relatively basic categorization can probably be altered to improve performance. More considerations about the different versions of the methods are presented in the online supplementary material.
External validity
The variables physical activity, sex, age, height and weight informed the three multiple imputation methods. Consequently, the three models had access to the same information. We also tested the methods including resting heart rate and ‘number of potatoes with warm meals’. By including the latter, all of the methods performed slightly better, and by including heart beat rate all of the methods performed slightly worse, but the methods performed approximately equally. The present five informing variables were chosen as they are readily available in most data sets.
The external validity of the methods may be questioned as the included subjects differed slightly from the excluded. However, the question is not whether the included and the excluded were comparable, but rather whether the relationship between physiology and portion sizes was different among the included and excluded, which does not seem very plausible.
Our reference or ‘gold standard’ was calculated from self-reported FFQ data with varying portion sizes and did not take into account information bias. It is well documented how self-reported values only to some degree reflect true intakes and that reporting of specific macronutrients may be differentially biased according to sex, weight and BMI( Reference Heitmann and Lissner 18 , Reference Fraser, Yan and Butler 19 ). All of the methods were affected by this reporting bias. Median and MLR are model-based and thereby the reporting error affected the model and had an overall effect on all imputations, i.e. possible over- and under-reporting will be spread out over the whole data. In contrast, Coca and KNN imputations are based on pairing similar individual observations and hence a systematic error will persist within the corresponding segments of the data.
Missing single values
Concerning FFQ with individual portion size questions, the MLR, Coca or KNN method can be used to substitute missing single values. In the Danish Health Examination Survey, from where the present data derive, 17·7 % of the questions on portion sizes were missing which is not uncommon in an FFQ( Reference Subar, Kipnis and Troiano 20 ). Currently, most studies probably ‘fill in the blanks’ with median values or standard portions( Reference Eekhout, de Boer and Twisk 21 ). As demonstrated in the present study, median imputation generates bias. If only a few values are missing the resulting bias may be negligible, but the impact of median imputation bias increases with the number of missing values. If one of the stochastic methods is used for imputation of single missing values, a comparable data set is always at hand: the subset of data with no missing values. We have supplied Coca SAS codes for this use in the online supplementary material.
FFQ without portion sizes
MLR or Coca may be used to include portion sizes in FFQ without individual portion size questions. In this case the portion sizes will have to be imputed from a comparable data set with portion sizes. Often traditional FFQ have later been improved with portion size questions and if the populations are similar, data from newer semi-quantitative FFQ can be used as learning data set. We have supplied SAS codes for this use also in the online supplementary material.
Multiple imputation
When applying multiple imputation, the multivariate analyses are run on multiple (e.g. ten) data sets each with different imputed values. The resulting parameter estimates are then the mean values of the ten analyses( Reference Rubin and Schenker 7 ). In the present paper we did not test our imputation methods’ ability to predict parameter estimates, but solely the ability to predict the reference TE, using ten imputations for each method. The online supplementary material provides SAS codes on how to do multiple regression modelling with multiple data sets.
In summary
MLR and Coca are both valuable methods for including portion sizes in FFQ or substituting missing portion size values. The KNN method seemed less attractive due to the differential bias in men and women, and the relatively high RMSE. In general, these three stochastic methods allowed for estimation of meaningful portion sizes by conditioning on information about physiology and they were suitable for multiple imputation. Application of sex-specific standard portion sizes inferred more bias than the other methods tested and diminished differences in energy intake related to age, for instance. We propose to use the MLR or Coca method to substitute missing portion size values or when portion sizes need to be included in FFQ without portion size data.
Acknowledgements
Acknowledgements: The authors thank Jesper Lauritsen and the Danish Diet, Cancer and Health project for developing the freeware FoodCalc®. Financial support: The Danish Health Examination Survey (DANHES) was funded by the Ministry of the Interior and Health and the Tryg Foundation. The survey was carried out by the National Institute of Public Health, University of Southern Denmark. The present work was supported by the Danish PhD School of Molecular Metabolism, Region Southern Denmark, University of Southern Denmark; the Research Unit for General Practice in Copenhagen, Denmark; and the A.P. Møller Foundation for Advancement of Medical Science. The funders had no role in the design, analysis or writing of this article. Conflict of interest: None. Authorship: R.K.-R., V.S., T.I.H., N.d.F.O., J.E.H. and B.L.H. participated in formulating the research questions and in designing the study; B.L.H. provided the data; R.K.-R., V.S. and T.I.H. performed the statistical analyses; R.K.-R., V.S., T.I.H., N.d.F.O., J.E.H. and B.L.H. analysed the results and contributed to the writing and editing of the manuscript draft; R.K.-R. wrote the manuscript. All authors read and approved the final manuscript. Ethics of human subject participation: The DANHES study was approved by the Danish local ethics committees and the Danish Data Protection Agency. The involved institutions’ review boards have approved the present study proposal.
Supplementary material
To view supplementary material for this article, please visit http://dx.doi.org/10.1017/S1368980014002389