Bowers et al. show that, in the domain of visual perception, recent deep neural network (DNN) models that have excellent predictive performance on some types of tasks, for example, object recognition, differ from human vision in inarguable ways, for example, being biased toward making predictions based on texture rather than shape. We agree that deep-learning networks are fundamentally limited as scientific models of vision.
Generalizing Bowers et al.'s excellent observations to domains of behavior research other than vision, we suggest that throwing big models at big datasets suffers from fundamental limitations while studying scientific phenomena with low retest or inter-rater reliability (Sifar & Srivastava, Reference Sifar and Srivastava2021). In particular, large parametric models, of which supervised machine-learning models constitute an important subset, presuppose a deterministic mathematical relationship between stimuli and labels, that is, when seeing features X, the model will emit a response y. When y is stochastic, and large models are trained using one possible instance of {X, y} observations, model predictions may actually end up becoming too good to be true, in the sense that they will offer statistically good predictions for phenomena that are, based on the features seen, actually unpredictable (Fudenberg, Kleinberg, Liang, & Mullainathan, Reference Fudenberg, Kleinberg, Liang and Mullainathan2019; Sifar & Srivastava, Reference Sifar and Srivastava2022).
Recent work has begun to quantify the notion of models being too good to be true. Fudenberg et al. (Reference Fudenberg, Kleinberg, Liang and Mullainathan2019) define completeness as the ratio of error reduction of a model from a naive baseline to error reduction of the best possible model from the same baseline, with the best possible model defined as the table of {X, y} mappings available in the training dataset. Since unreliable behavior intrinsically implies that the same X can correspond to more than one y, and since the model can only predict either any one of these values or an average of them, there will be some degree of irreducible error in even the best possible model.
Similarly, Sifar and Srivastava (Reference Sifar and Srivastava2022) measured the retest reliability of economic preferences for risky choice using the classic “decisions from description” paradigm. They note that a basic statistical identity $\rho _{m, s2} \le \;\rho _{s1, s2}\rho _{m, s1} + \sqrt {( {1-\;\rho_{s1, s2}^2 } ) ( {1\;-\;\rho_{m, s1}^2 } ) }$ limits the consistency of a model m with data observed in two sessions s1 and s2. This relationship is graphically illustrated in Figure 1, showing that for low retest reliability, extremely high correlations between the model and one session's data are guaranteed to produce much lower correlations between that model and the other session's data, even if both sessions use the same target stimuli and protocol. Thus, the model with the best predictive accuracy when trained with one sessions' data is guaranteed to have poorer performance if tested on data collected in another session from the same participants for the same problems. Based on the measured retest reliability of economic choices, Sifar and Srivastava (Reference Sifar and Srivastava2022) suggest that models showing a correlation greater than 0.85 to any given dataset may not truly be capturing important psychological phenomena about risky choices, but rather simply be overfit to dataset characteristics. Interestingly, this seems to suggest that simple generalized utility models like prospect theory are already “good enough” models of risky choice, a conclusion also reached independently by Fudenberg et al. (Reference Fudenberg, Kleinberg, Liang and Mullainathan2019).
Figure 1. All points below the x = y line on each of the curves indicate a situation where model m is guaranteed to perform worse in predicting s2 data when fitted to s1 data.
Notably, prediction error as observed in retest observations of a phenomenon cannot be controlled either by increasing model size or dataset size, as is prominently being recommended these days (Agrawal, Peterson & Griffiths, Reference Agrawal, Peterson and Griffiths2020; Peterson, Bourgin, Agrawal, Reichman, & Griffiths, Reference Peterson, Bourgin, Agrawal, Reichman and Griffiths2021; Yarkoni & Westfall, Reference Yarkoni and Westfall2017). It can only be reduced by adding more features to datasets by measuring and characterizing more sources of variability (Sifar & Srivastava, Reference Sifar and Srivastava2022). Thus, limits to predictability based on data unreliability imply that statistical model selection breaks down beyond a point for even the largest models and datasets; once multiple models can fit the data well enough, considerations other than goodness-of-fit must differentiate them.
In many important domains of behavior, small theory-driven models already offer predictions close to test reliability or inter-rater agreement levels in terms of accuracy (Fudenberg et al., Reference Fudenberg, Kleinberg, Liang and Mullainathan2019; Martin, Hofman, Sharma, Anderson, & Watts, Reference Martin, Hofman, Sharma, Anderson and Watts2016). For instance, while prospect theory is already close to an ideal model in terms of error reduction, as shown by Fudenberg et al. (Reference Fudenberg, Kleinberg, Liang and Mullainathan2019), massive reductions in error beyond what an ideal model would be capable of are statistically claimed by large models fit to large datasets using the same impoverished feature sets that prospect theory uses (Bhatia & He, Reference Bhatia and He2021; Peterson et al., Reference Peterson, Bourgin, Agrawal, Reichman and Griffiths2021). The theoretical claims of such large models, however, simply offer minor modifications to the shape of the utility function used in prospect theory (Peterson et al., Reference Peterson, Bourgin, Agrawal, Reichman and Griffiths2021). We argue that, in contrast to statistical predictability, scientific understanding cannot be advanced simply by fitting bigger models to bigger datasets; doing so requires fitting better models to better datasets by identifying new features that uncover additional sources of principled variation in the data.
In summary, we agree with Bowers et al. that deep-learning models, while excellent in predictive terms, may not offer unalloyedly deep insight into scientific phenomena, a trait we propose they share with other large statistical models with weak theoretical commitments endemic in many studies of behavior (Cichy & Kaiser, Reference Cichy and Kaiser2019). While the ability to search more complex function classes rather than simpler ones for models of behavior is an attractive proposition recently made possible by advances in machine learning, it is important to remain aware that using large amounts of data, with each datum generated as a restricted sample from a highly variable phenomenon, to fit highly flexible models runs the risk of obtaining high accuracy models of the dataset rather than the underlying scientific phenomenon of interest. Respecting fundamental limits to predictability of cognitive behavior must necessarily foreground mechanistic plausibility, conceptual parsimony, and consilience as criteria beyond empirical risk minimization for differentiating theoretical models.
Bowers et al. show that, in the domain of visual perception, recent deep neural network (DNN) models that have excellent predictive performance on some types of tasks, for example, object recognition, differ from human vision in inarguable ways, for example, being biased toward making predictions based on texture rather than shape. We agree that deep-learning networks are fundamentally limited as scientific models of vision.
Generalizing Bowers et al.'s excellent observations to domains of behavior research other than vision, we suggest that throwing big models at big datasets suffers from fundamental limitations while studying scientific phenomena with low retest or inter-rater reliability (Sifar & Srivastava, Reference Sifar and Srivastava2021). In particular, large parametric models, of which supervised machine-learning models constitute an important subset, presuppose a deterministic mathematical relationship between stimuli and labels, that is, when seeing features X, the model will emit a response y. When y is stochastic, and large models are trained using one possible instance of {X, y} observations, model predictions may actually end up becoming too good to be true, in the sense that they will offer statistically good predictions for phenomena that are, based on the features seen, actually unpredictable (Fudenberg, Kleinberg, Liang, & Mullainathan, Reference Fudenberg, Kleinberg, Liang and Mullainathan2019; Sifar & Srivastava, Reference Sifar and Srivastava2022).
Recent work has begun to quantify the notion of models being too good to be true. Fudenberg et al. (Reference Fudenberg, Kleinberg, Liang and Mullainathan2019) define completeness as the ratio of error reduction of a model from a naive baseline to error reduction of the best possible model from the same baseline, with the best possible model defined as the table of {X, y} mappings available in the training dataset. Since unreliable behavior intrinsically implies that the same X can correspond to more than one y, and since the model can only predict either any one of these values or an average of them, there will be some degree of irreducible error in even the best possible model.
Similarly, Sifar and Srivastava (Reference Sifar and Srivastava2022) measured the retest reliability of economic preferences for risky choice using the classic “decisions from description” paradigm. They note that a basic statistical identity $\rho _{m, s2} \le \;\rho _{s1, s2}\rho _{m, s1} + \sqrt {( {1-\;\rho_{s1, s2}^2 } ) ( {1\;-\;\rho_{m, s1}^2 } ) }$ limits the consistency of a model m with data observed in two sessions s1 and s2. This relationship is graphically illustrated in Figure 1, showing that for low retest reliability, extremely high correlations between the model and one session's data are guaranteed to produce much lower correlations between that model and the other session's data, even if both sessions use the same target stimuli and protocol. Thus, the model with the best predictive accuracy when trained with one sessions' data is guaranteed to have poorer performance if tested on data collected in another session from the same participants for the same problems. Based on the measured retest reliability of economic choices, Sifar and Srivastava (Reference Sifar and Srivastava2022) suggest that models showing a correlation greater than 0.85 to any given dataset may not truly be capturing important psychological phenomena about risky choices, but rather simply be overfit to dataset characteristics. Interestingly, this seems to suggest that simple generalized utility models like prospect theory are already “good enough” models of risky choice, a conclusion also reached independently by Fudenberg et al. (Reference Fudenberg, Kleinberg, Liang and Mullainathan2019).
Figure 1. All points below the x = y line on each of the curves indicate a situation where model m is guaranteed to perform worse in predicting s2 data when fitted to s1 data.
Notably, prediction error as observed in retest observations of a phenomenon cannot be controlled either by increasing model size or dataset size, as is prominently being recommended these days (Agrawal, Peterson & Griffiths, Reference Agrawal, Peterson and Griffiths2020; Peterson, Bourgin, Agrawal, Reichman, & Griffiths, Reference Peterson, Bourgin, Agrawal, Reichman and Griffiths2021; Yarkoni & Westfall, Reference Yarkoni and Westfall2017). It can only be reduced by adding more features to datasets by measuring and characterizing more sources of variability (Sifar & Srivastava, Reference Sifar and Srivastava2022). Thus, limits to predictability based on data unreliability imply that statistical model selection breaks down beyond a point for even the largest models and datasets; once multiple models can fit the data well enough, considerations other than goodness-of-fit must differentiate them.
In many important domains of behavior, small theory-driven models already offer predictions close to test reliability or inter-rater agreement levels in terms of accuracy (Fudenberg et al., Reference Fudenberg, Kleinberg, Liang and Mullainathan2019; Martin, Hofman, Sharma, Anderson, & Watts, Reference Martin, Hofman, Sharma, Anderson and Watts2016). For instance, while prospect theory is already close to an ideal model in terms of error reduction, as shown by Fudenberg et al. (Reference Fudenberg, Kleinberg, Liang and Mullainathan2019), massive reductions in error beyond what an ideal model would be capable of are statistically claimed by large models fit to large datasets using the same impoverished feature sets that prospect theory uses (Bhatia & He, Reference Bhatia and He2021; Peterson et al., Reference Peterson, Bourgin, Agrawal, Reichman and Griffiths2021). The theoretical claims of such large models, however, simply offer minor modifications to the shape of the utility function used in prospect theory (Peterson et al., Reference Peterson, Bourgin, Agrawal, Reichman and Griffiths2021). We argue that, in contrast to statistical predictability, scientific understanding cannot be advanced simply by fitting bigger models to bigger datasets; doing so requires fitting better models to better datasets by identifying new features that uncover additional sources of principled variation in the data.
In summary, we agree with Bowers et al. that deep-learning models, while excellent in predictive terms, may not offer unalloyedly deep insight into scientific phenomena, a trait we propose they share with other large statistical models with weak theoretical commitments endemic in many studies of behavior (Cichy & Kaiser, Reference Cichy and Kaiser2019). While the ability to search more complex function classes rather than simpler ones for models of behavior is an attractive proposition recently made possible by advances in machine learning, it is important to remain aware that using large amounts of data, with each datum generated as a restricted sample from a highly variable phenomenon, to fit highly flexible models runs the risk of obtaining high accuracy models of the dataset rather than the underlying scientific phenomenon of interest. Respecting fundamental limits to predictability of cognitive behavior must necessarily foreground mechanistic plausibility, conceptual parsimony, and consilience as criteria beyond empirical risk minimization for differentiating theoretical models.
Financial support
This work is not supported by any funding organizations.
Competing interest
None.