1 Why care about hyperparameters?
When political scientists work with machine learning models, they want to find a model that generalizes well from training data to new, unseen data.Footnote 1 Hyperparameters play a key role in this endeavor because they determine the models’ capacity to generalize. Finding a good set of hyperparameters critically affects conclusions about a model's performance. The failure to correctly tune and report hyperparameters has recently been identified as a key impediment to the accumulation of knowledge in computer science (e.g. Henderson et al., Reference Henderson, Islam, Bachman, Pineau, Precup and Meger2018; Melis et al., Reference Melis, Dyer and Blunsom2018; Bouthillier et al., Reference Bouthillier, Laurent and Vincent2019, Reference Bouthillier, Delaunay, Bronzi, Trofimov, Nichyporuk, Szeto, Sepahvand, Raff, Madan, Voleti, Kahou, Michalski, Arbel, Pal, Varoquaux and Vincent2021; Cooper et al., Reference Cooper, Lu, Forde and De Sa2021; Gundersen et al., Reference Gundersen, Coakley, Kirkpatrick and Gil2023). Is political science making the same mistake?
We examined 64 machine learning-related papers published between 1 January 2016 and 20 October 2021 in some of the top journals of our discipline—the American Political Science Review (APSR), Political Analysis (PA), and Political Science Research and Methods (PSRM). Of the 64 publications we analyzed, 36 (56.25 percent) do not report the values of their hyperparameters, neither in the paper nor the appendix. Forty-nine publications (76.56 percent) do not share information about how they used tuning to find the values of their hyperparameters. Only 13 publications (20.31 percent) offer a complete account of the hyperparameters and their tuning. Not being transparent is a dangerous habit because readers and reviewers cannot assess the quality of a manuscript without access to the replication code.
With this paper, therefore, we raise the awareness that hyperparameters and their tuning matter. In statistical inference, the goal is to estimate the value of an unknowable population parameter. Including robustness checks in a paper and its appendix is good practice, allowing others to understand critical choices in research design and statistical modeling. The actual out-of-sample performance of a machine learning model is such an unknown quantity, too. We suggest handling estimates of population parameters and hyperparameters in machine learning models with the same loving care.
First, we explain what hyperparameters are and why they are essential. Second, we show why it is dangerous not to be transparent about hyperparameters. Third, we offer best practice advice about properly selecting hyperparameters. Finally, we illustrate our points by comparing the performance of several machine learning models to predict electoral violence from tweets (Muchlinski et al., Reference Muchlinski, Yang, Birch, Macdonald and Ounis2021).
2 What are hyperparameters and why do they need to be tuned?
Many machine learning models have parameters and also hyperparameters. Model parameters are learned during training, and hyperparameters are typically set before training. Hyperparameters determine how and what a model can learn and how well the model will perform on out-of-sample data. Hyperparameters are thus situated at a meta-level above the models themselves.
Consider the following stylized example displayed in Figure 1.Footnote 2 A linear regression approach could model the relationship between X and Y as $\hat Y = \beta _0 + \beta _1 X$. A more flexible model would include additional polynomials in X. For example, choosing λ = 2 encodes the theoretical belief that Y is best predicted by a quadratic function of X, i.e., $\hat Y = \beta _0 + \beta _1 X + \beta _2 X^2$. But it is also possible to rely on data only to find the optimal value of λ. Measuring the generalization error with a metric like the mean squared error helps empirically select the most promising value of λ.
This polynomial regression comes with both parameters and hyperparameters. Parameters are variables that belong to the model itself, in our example, the regression equation coefficients. Hyperparameters are those variables that help specify the exact model. In the context of the polynomial regression, λ is the hyperparameter that determines how many parameters will be learned (Goodfellow et al., Reference Goodfellow, Bengio and Courville2016). Machine learning models can, of course, come with many more hyperparameters that relate not only to the exact parameterization of the machine learning model. Anything part of the function that maps the data to a performance measure and that can be set to different values can be considered a hyperparameter, e.g., the choice and settings of a kernel in a support vector machine (SVM), the number of trees in a random forest (RF), or the choice of a particular optimization algorithm.
3 Misselecting hyperparameters
Research on machine learning has recently identified several problems that may arise from handling hyperparameters without care. The failure to report the chosen hyperparameters impedes scientific progress (Henderson et al., Reference Henderson, Islam, Bachman, Pineau, Precup and Meger2018; Bouthillier et al., Reference Bouthillier, Laurent and Vincent2019, Reference Bouthillier, Delaunay, Bronzi, Trofimov, Nichyporuk, Szeto, Sepahvand, Raff, Madan, Voleti, Kahou, Michalski, Arbel, Pal, Varoquaux and Vincent2021; Gundersen et al., Reference Gundersen, Coakley, Kirkpatrick and Gil2023). In the face of a hyperparameter space marked by the curse of dimensionality, other researchers can only replicate published work if they know the hyperparameters used in the original study (Sculley et al., Reference Sculley, Snoek, Wiltschko and Rahimi2018). In addition, it is essential to tune the hyperparameters of all models, including baseline models. Without such tuning, it is impossible to compare the performance of two different models M a and M b: While some may find that the performance of M a is better than M b, others replicating the study with different hyperparameter settings could conclude the opposite: that indeed M a is not better than that of M b. Such “hyperparameter deception” (Cooper et al., Reference Cooper, Lu, Forde and De Sa2021) has confused scientific progress in various subfields in computer science where machine learning plays a key role, including natural language processing (Melis et al., Reference Melis, Dyer and Blunsom2018), computer vision (Musgrave et al., Reference Musgrave, Belongie and Lim2020), and generative models (Lucic et al., Reference Lucic, Kurach, Michalski, Gelly and Bousquet2018). Reviewers and readers need to comprehend the hyperparameter tuning to assess whether a new model reliably performs better or whether a study tests new hyperparameters (Cooper et al., Reference Cooper, Lu, Forde and De Sa2021).
It is good to see political scientists also discuss and stress the relevance of hyperparameter tuning in their work (e.g., Cranmer and Desmarais, Reference Cranmer and Desmarais2017; Fariss and Jones, Reference Fariss and Jones2018; Chang and Masterson, Reference Chang and Masterson2020; Miller et al., Reference Miller, Linder and Mebane2020; Rheault and Cochrane, Reference Rheault and Cochrane2020; Torres and Francisco, Reference Torres and Francisco2021). But does the broader political science community fulfill the requirements suggested in the computer science literature? To understand how hyperparameters are used in the discipline, we searched for the term “machine learning” in all papers published in APSR, PA, and PSRM after 1 January 2016 and before 20 October 2021. Suppose a paper applies a machine learning model with tunable hyperparameters. In that case, we first annotate whether the authors report the final values of hyperparameters for all models in their paper or its appendix.Footnote 3 We also record whether authors transparently describe how they tuned hyperparameters.Footnote 4 Table 1 summarizes the findings from our annotations. We find that 34 (53.12 percent) publications neither report the values of the final hyperparameters nor the tuning regime in the publication or its appendix. Another 15 publications (23.44 percent) offer information about the final hyperparameter values but not how they tuned the machine learning models. In two cases (3.12 percent), we find no information about the final values of the hyperparameters but about the tuning regime. Finally, only 13 publications (20.31 percent) offer a full account of both the final choice of the hyperparameters and the way the tuning occurred in either the paper itself or its appendix.
Note that we annotated the literature in a way that helps understand whether reviewers and readers can assess the robustness of the analyses based on the manuscript and its appendix. Our analysis does not consider the replication code since it typically does not find consideration in the review process. In addition, we do not make any judgments about correctness. A paper without information about hyperparameter values or their tuning can still be correct. Similarly, a paper that reports hyperparameter values and a complete account of the tuning can still be wrong. It is the realm of reviewers to evaluate the quality of a manuscript. But without a complete account of hyperparameter values and tuning, readers and, in particular, reviewers cannot judge whether hyperparameter tuning is technically sound.
4 Best practice
Hyperparameters are a fundamental element of machine learning models. Documenting their careful selection helps build trust in the insights gained from machine learning models.
4.1 Selecting hyperparameters for performance tuning
Without automated procedures for finding hyperparameters, researchers need to rely on heuristics (Probst et al., Reference Probst, Boulesteix and Bischl2019). The classic approach to hyperparameter optimization is to systematically try different hyperparameter settings and compare the models using a performance measure. Machine learning splits the data into training, validation, and test data (Friedman et al., Reference Friedman, Hastie and Tibshirani2001; Goodfellow et al., Reference Goodfellow, Bengio and Courville2016). The model parameters are optimized using the training data. The validation data is used to optimize the hyperparameters by estimating and then comparing an estimate of the performance of all the different models. Finally, the test data helps approximate the performance of the best model for out-of-sample data. Researchers should train a final machine learning model for a realistic estimate of the model's performance. This model relies upon the identified best set of hyperparameters, uses a combined set of the training and validation data, and is evaluated on the so far withheld test set. Note that this last evaluation can be done only once to avoid information leakage. Tuning hyperparameters is therefore not a form of “p-hacking” (Wasserstein and Lazar, Reference Wasserstein and Lazar2016; Gigerenzer, Reference Gigerenzer2018) where researchers try different models until they find the one that generates the desired statistics. On the contrary, transparently testing different hyperparameter values is necessary to find a model that generalizes well.
In hyperparameter grid search, researchers manually define a grid of hyperparameter values, then try each possible permutation and record the validation performance for each set of hyperparameters. More recently, some instead suggest randomly sampling a large number of hyperparameter candidate values from a pre-defined search space (Bergstra and Bengio, Reference Bergstra and Bengio2012) and recording the validation performance of each set of sampled hyperparameter values.Footnote 5 This random search can help explore the space of hyperparameters more efficiently if some hyperparameters are more important than others. Both approaches typically yield reliable and good results for practitioners and build trust regarding the out-of-sample performance.
But the tuning of hyperparameters might be too involved for grid or random search in light of resource constraints. It is then useful to not try all combinations of hyperparameters but rather focus on the most promising ones.Footnote 6 Sequential model-based Bayesian optimization formalizes such a search for a new candidate set of hyperparameters (Snoek et al., Reference Snoek, Larochelle and Adams2012; Shahriari et al., Reference Shahriari, Swersky, Wang, Adams and de Freitas2016). The core idea is to formulate a surrogate model—think non-linear regression model—that predicts the machine learning model's performance for a set of hyperparameters. At iteration t, the underlying machine learning model is trained with the surrogate model's suggestion for the next best candidate set of hyperparameters. The results from this training at t are fed back into the surrogate model and used to refine the predictions for the candidate set of hyperparameters in the next iteration t + 1.Footnote 7
Without a formal solution, the selection of hyperparameters requires human judgment. We suggest relying on the following short heuristics when tuning and communicating hyperparameters.Footnote 8
1. Understanding the model. What are the available hyperparameters? How do they affect the model?
2. Choosing a performance measure. What is a good performance for the machine learning model? Depending on the respective task, appropriate measures help assess the model's success. For example, a regression model is trained to minimize the mean squared error. Classification models can be trained to maximize the F1 score. With an appropriate performance measure, it is also possible to systematically tune the hyperparameters of unsupervised models (Fan et al., Reference Fan, Yue, Sarkar and Wang2020).
3. Defining a sensible search space. Useful starting points for the hyperparameters can be the default values in software libraries, recommendations from the literature, or own previous experience (Probst et al., Reference Probst, Boulesteix and Bischl2019). Any choice may also be informed by considerations about the data-generating process. If the hyperparameters are numerical, there may be a difference between mathematically possible and reasonable values.
4. Finding the best combination in the search space. In grid search, researchers should try every possible combination of the hyperparameters of the search space to find the optimal combination. In random search, each run picks a different random set of hyperparameters from the search space.
5. Tuning under strong resource constraints. If the model training is too involved, adaptive approaches such as sequential model-based Bayesian optimization allow for efficiently identifying and testing promising hyperparameter candidates.
Researchers should describe in either the main body or the appendix of their publication how they tuned their hyperparameters and also what final values they chose. Only then can reviewers and readers assess the robustness of machine learning models.
4.2 Illustration: Comparing machine learning models to predict electoral violence from tweets
To illustrate our point, we compare machine learning models trained to predict electoral violence from tweets. Muchlinski et al. (Reference Muchlinski, Yang, Birch, Macdonald and Ounis2021) collected Tweets around elections in three countries (Ghana, the Philippines, and Venezuela) and annotated whether these messages described occurrences of electoral violence. We re-scraped the data based on the shared Tweet IDs. To predict these occurrences from the content of these Tweets, we use four different machine learning models—a naive Bayes classifier (NB), random forest (RF), a support vector machine (SVM), and a convolutional neural network (CNN).
Table 2 summarizes our results. In the left column of each country, we report the results from training the models with default hyperparameters. On the right, we show the results after hyperparameter tuning.Footnote 9 Hyperparameter tuning improves the out-of-sample performance for most machine learning models in our experiment.Footnote 10 Table 2 also shows how easy it is to be deceived about the relative performance of different models—if hyperparameters are not properly tuned. The performance gains from tuning are so substantial that most tuned models outperform any other model with default hyperparameters. In the case of Venezuela, for example, comparing a tuned model with all other baseline models at their default hyperparameter settings could lead to different conclusions. Researchers could mistakenly conclude that (a tuned) NB classifier (F1 = 0.308) is at eye-level with a CNN model (F1 = 0.319) and better than any other method; or also that the RF is the better model (F1 = 0.479), or the SVM (F1 = 0.465), or the CNN (F1 = 0.304). In short, model comparisons and model choices are only meaningful if all hyperparameters of all models are systematically tuned and if this tuning is transparently documented.
On the left: results with default values for the hyperparameters. On the right: results from tuned hyperparameters
5 Tuning hyperparameters matters
Hyperparameters critically influence how well machine learning models perform on unseen, out-of-sample data. Despite the relevance of tuned hyperparameters, we found that only 20.31 percent of the papers using machine learning models published in APSR, PA, and PSRM between 2016 and 2021 include information about the ultimate hyperparameter choice and how they were found in the manuscript or the appendix. Furthermore, 34 papers (53.12 percent) neither report the hyperparameters nor their tuning. This is a dangerous habit since handling hyperparameters without care can lead to wrong conclusions about model performance and model choice.
The search for an optimal set of hyperparameters is a vibrant research area in computer science and statistics. For most of the applications in our discipline, acknowledging and discussing how the choice of hyperparameters could influence results in combination with a proper and systematic search for appropriate hyperparameters would go a long way. It would allow others to understand original work, assess its validity, and thus ultimately help build trust in political science that uses machine learning.
Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/psrm.2023.61. To obtain replication material for this article, https://doi.org/10.7910/DVN/HLJW1Q
Acknowledgements
Thomas Gschwend, Oliver Rittmann, and Zach Warner provided very insightful feedback on earlier versions of the draft. We also thank three anonymous reviewers and the editor for constructive feedback in improving the quality of the manuscript.