A. Introduction
We knew it all along, didn’t we? A hungry judge is a stricter one. In 2011, a now famous study claimed to prove the point.Footnote 1 Judges were observed to issue harsher decisions just before their lunch break. The study demonstrated what legal realists had always been warning us about: judicial decision-making is not exempt from the pitfalls of human thinking.Footnote 2 Today the “hungry judge” effect has entered common wisdom.Footnote 3 Many commentators rely on it to argue for far-reaching consequences. Since the rise of technologies that are commonly summed up under the label “artificial intelligence” (AI),Footnote 4 machine-based decision aids appear as obvious remedies to human bias. After all, machines do not feel hunger or fatigue. The evidence for the fallibility of the “hungry judge,” however, is not nearly as conclusive as commonly assumed. On the contrary, there are indications that the correlation found in the original study was spurious. Thus, reliance on the study to motivate legal interventions is problematic. First, it leads to a misconstruction of the normative problem. As automated decision-aids potentially introduce new biases to decision-making procedures, their employment requires a careful appraisal of risks and benefits. The “hungry judge” argument skews this analysis. Second, presenting the justice system as worse than it really is makes for a dangerous argumentative strategy. Trust in legal institutions is a precious societal resource, which should not be undermined without reliable evidence.Footnote 5
This article consists of three parts. The first part introduces the “hungry judge” effect and describes its place in the legal literature, where it is broadly and uncritically used as an argument in the debate on the benefits and risks associated with the use of AI.Footnote 6 The second part questions this uncritical reliance.Footnote 7 In particular, it takes a close look at the “hungry judge” study and the arguments against its validity.Footnote 8 The last part reflects on our use of social science evidence in legal discourse.Footnote 9 It argues for a careful approach and warns against the lure of narratives and the dangers of “problem gerrymandering.”
B. The “Hungry Judge” Effect and the Argument for Artificial Intelligence
The “hungry judge” effect is famous. It is the main result of an observational study by researchers Shai Danziger, Jonathan Levav, and Liora Avnaim-Pesso on the decisions of a real-life panel of Israeli judges ruling on the early release of prisoners on bail.Footnote 10 According to the study, the likelihood that an application for early release will be rejected increased as the breakfast or lunch break approached. The explanation is intuitive: fatigue or hunger influences our decision-making. The study was published prominently in the “Proceedings of the National Academy of Sciences” (PNAS) in 2011. In the same year, famous psychologist and Nobel laureate in economics, Daniel Kahneman, presented the study in his bestseller, ”Thinking, Fast and Slow,” thus making it accessible to a wider audience.Footnote 11 The process was apparently eased by the fact that Kahneman had already served as the study’s “prearranged editor” at PNAS.Footnote 12 Due to the spectacular and intuitive result, the study received significant media coverage.Footnote 13 Since then, the study has been widely referenced, with Google Scholar counting well over one thousand citations.Footnote 14
The study is cited in various contexts. In the legal literature, the study is typically used to confirm the basic assumption of legal realists—that factors beyond black-letter law influence a judge’s decision and to motivate further research into these factors, with a special focus on heuristics and biases.Footnote 15 Thus, the study is used to emphasize that our knowledge on judicial decision-making is limited and that we should strive to learn more about it. However, the “hungry judge” effect is also used to argue for policy-interventions. Specifically, the “hungry judge” effect is often evoked in the discourse on automated decision-aids, or AI, in law. As we will see, reliance on the “hungry judge” is problematic in general, but particularly so in the latter context.
The use of AI in law is a sensitive issue. Generally, the label AI is used as an umbrella term for technologies that, based on the acquisition and interpretation of data, can perform, or “mimic,” human decision-making tasks.Footnote 16 Such technologies have the potential to support legal practice. For example, AI can be employed for mass decision-making in administrative or tax proceedings.Footnote 17 The uses of AI, however, must be critically assessed from a human rights perspective.Footnote 18 Two prominent areas of application are particularly sensitive and, thus, are met with skepticism. The first area is algorithmic prediction of possible crime that allows for so-called predictive policing in order to guide the allocation of police resources.Footnote 19 The second area relates to risk-calculating tools that can be used to predict recidivism of defendants in criminal procedures.Footnote 20 The skepticism towards these applications is not just due to a general skepticism towards machine-based decisions.Footnote 21 The main worry about their use is that they may introduce new biases and risks to the decision-making procedure, particularly as they may disadvantage certain groups.Footnote 22
Much of the critical debate focuses on the use of software such as COMPAS, a software that provides judges with a risk-assessment for an individual defendant in court. Such software is regarded as problematic for at least three reasons. First, the assessment provided by the software’s algorithm may be biased. While it is possible to exclude obviously discriminating criteria—such as race—from the risk assessment, these criteria might still be correlated with other criteria that are very well included.Footnote 23 Second, it is difficult to control the inner workings of such software, as they are commercial and proprietary products.Footnote 24 Third, there are dangers connected to the interaction between the automated decision-aid and human judges who might place too much confidence in the recommendation they are provided with.Footnote 25 Warnings that are meant to counteract overreliance have been found to have only a very limited effect.Footnote 26
These problems are well-known and have a prominent place in the discourse on AI in law. This is where the “hungry judge” effect comes into play as a counterargument. In particular, the “hungry judge” effect is used to underline that human judgment is already susceptible to biases.Footnote 27 To give an example, the U.S.-based economist Jennifer Doleac who works on issues of sentencing and on racial discrimination, counters the discussion about biased algorithms as follows: “[T]hat discussion misses an important point: Humans are biased. We routinely get things wrong due to an array of cognitive biases. Judges’ decisions are affected by factors such as whether they’re hungry (decisions are far less favorable just before lunch) and if their local football team won that weekend[.]”Footnote 28 So, the argument goes, even if there is a danger that algorithmic decision-making was biased, could it be more biased than judges who leave people incarcerated because they cannot control their hunger? AI systems could, at least, help judges detect, counteract, or exclude their own subconscious biases, thus increasing fairness and legal certainty.Footnote 29
Such reference to “hungry judges” is not just occasional but can be seen as particularly popular—and indeed common. References can be found in all different genres of writing on AI in law: In the scholarly literature,Footnote 30 in policy papers,Footnote 31 as well as in opinion pieces.Footnote 32 The references are so broad that it feels safe to describe “hungry judges” as part of an “AI lore.” But as already indicated, referring to “hungry judges” is not just an innocent triviality. The “hungry judge” argument matters. The introduction of AI into legal decision-making comes with potential trade-offs. This requires political debate and a careful cost-benefit analysis. The “hungry judge” argument paints a very negative picture of the current reality in the courtroom. Compared to such arbitrary practice as letting a judicial decision depend on its proximity to a break, every intervention that makes a decision more rational appears legitimate—even urgent. Thus, the argument increases the appeal of automated decision-aids to an audience that might be critical at first.
The broad reliance on the “hungry judge” effect and the sensitive context of far-reaching policy proposals warrants a critical engagement with the evidence. As we will see, the confidence that the discourse puts into the original study is not well placed.Footnote 33 So, while human bias might indeed be a valid argument for the use of automated decision-aids, the particular reference to the “hungry judge” effect is not.
C. Is There a “Hungry Judge” Effect?
The following discussion starts by revisiting the original study in detailFootnote 34 and then moves to the arguments against its validity. The objections focus on the magnitude of the found effect,Footnote 35 the factual basis of the study,Footnote 36 and alternative explanations for the found effect.Footnote 37
I. Danziger etal.: “Yes”
In their 2011 study, Danziger etal. analyze 1,112 decisions by eight judges on fifty days over the course of ten months.Footnote 38 The judges decided on requests from prison inmates to have their prison sentences suspended. During the examined period, between fourteen and thirty-five cases were processed every day, with an average processing time of just under six minutes per case. The authors’ data include the order of the cases and the time of the hearing. Each day, sessions were structured around two breaks: A (late) breakfast and a lunch break. According to the authors, the judges had no knowledge of the next case and no influence on the order of the cases. On this basis, Danziger etal. test how the position of a case in the order of all cases and breaks affect the decision-making behavior of the judges. All case decisions were coded as either favorable or unfavorable. The hypothesis of the study is that, due to fatigue, repeated decisions would lead to more decisions maintaining the status quo, especially because favorable decisions were more labor-intensive due to the greater effort involved in giving a justification for an early release.Footnote 39
The analysis shows that the number of unfavorable decisions strongly increased before the breaks. The highest probability of a favorable decision is around sixty-five percent at the beginning of a session and then falls to almost zero—after a break, it is then again around sixty-five percent.Footnote 40 Using a statistical control technique,Footnote 41 the authors examine to what extent other factors—such as certain characteristics of the case—could explain these results. Of the “legal” variables that can be found in the case file, previous convictions and the lack of a rehabilitation program significantly reduce the likelihood of a favorable decision. Importantly, the order in which cases were decided was found to be decisive.
Crucially, the internal validity of the study—that is, whether the research design justifies the inference—depends on whether the sequence of cases was actually random—that is, whether it can be safely assumed that cases that appeared at the end of a session did not differ in any particular characteristic from the earlier cases.Footnote 42 Only then does the conclusion hold that it is, indeed, the order that explains a certain decision. Instead, this would not be the case if an unobserved variable affected the order of the cases. In contrast to an experiment, in an observational study, the order in which cases are decided and the information available to the decision-makers cannot be determined by the experimenter, thus the influence of one or more unobserved variables cannot be excluded.Footnote 43
Every observational study that aims at establishing causality tries to exclude—or control for—unobserved influencing factors. Here, the assumption that the judges had no knowledge of the next case is of crucial significance. The same applies to the statement that the time at which each case is dealt with depends only on the arrival time of an inmate’s lawyer.Footnote 44 Based on these assumptions, the order of the cases would be exogenous from the judges’ perspective. Thus, Danziger etal. conclude that the only explanation for the high rejection rate shortly before the breaks is the judges’ fatigue—or, more bluntly, their hunger.
II. Effect Size and Implausibility
There are several objections to the validity of the “hungry judge” study. The most general one is related to the size of the effect found by Danziger etal. The effect is particularly large, which calls for caution.Footnote 45 Generally, psychological effects reported for the first time tend to be larger than in later replications.Footnote 46 But the effect size also appears very large specifically when compared to the effects measured in controlled laboratory studies on fatigue symptoms.Footnote 47 For this reason, critics like psychologist Daniel Lakens have even dismissed the study outright. On his blog, Lakens describes the effect as implausible and even impossible. It is simply too big to be caused by a psychological mechanism.Footnote 48 To quote Lakens: “If hunger had an effect on our mental resources of this magnitude, our society would fall into minor chaos every day at 11:45. Or at the very least, our society would have organized itself around this incredibly strong effect of mental depletion.”Footnote 49 The effect appears so big that one would actually not need a scientific study to identify it. For Lakens, the lack of a plausible theory that explains an effect of this magnitude already suffices to dismiss the study entirely. Another good reason for this objection is that we do not observe any kink in the data immediately after a lunch break, where our attention typically fades and we would assume another point of possible mental depletion.Footnote 50 Apart from this “implausibility critique,” there are further objections. The next one attacks the factual basis of the study.
III. Factual Basis: Non-random Ordering and Legal Representation
Danziger etal. rely crucially on the assumption that the order of the cases is random and, thus, exogenous to the decision-making process. This assumption has been forcefully challenged. For a short and very critical reply in PNAS, Keren Weinshall-Margel and John Shapard analyzed the data of the original study—as well as other self-collected data—and conducted additional interviews with the court personnel involved.Footnote 51 They point out that the order of the cases is not random: The panel tries to deal with all cases from one prison before a break, before then moving to the cases of the next prison after a break. Most importantly, though, requests from prisoners who are not represented by a lawyer are typically dealt with at the end of each session. So, prisoners without legal representation are less likely to receive a favorable decision compared to those with legal representation.Footnote 52 Additionally, lawyers often represent several inmates and decide on the order in which the cases are presented—it might well be possible that they start with the strongest cases.Footnote 53
Danziger etal. reacted to this criticism in a short reply, seeing no reason to revise their conclusions. But, the reply does not appear conclusive. To begin with, they state that ordering did not depend on the prison of origin. However, to back that claim, they could only provide data on the prison of origin for five of the fifty days they used for their first analysis. Further, they rule out an ordering according to specific prison because the order of the cases depended on the arrival of the lawyers.Footnote 54 In addition, the cases in which the same lawyer represented several inmates were too rare to affect the analysis. Most importantly though, Danziger etal. report to have now included legal representation as a control variable in their regression analysis, and that they see their results confirmed.Footnote 55 However, they do not report whether this reduces the size of the effect. Thus, their answer is vague. Crucial questions remain unanswered: how strong is the effect of ordering if one accounts for legal representation? And, what is even more important: Can it be said with certainty that the results are not influenced by an unobserved variable?
IV. Alternative Explanation: Time Management
As the size of the effect appears implausible, and the factual basis for assuming a random case order unsure, could there be alternative explanations for the observed effect? Apart from the already mentioned legal representation, a further unobserved variable that could influence the case order and explain the results would be the workload management by the judges.Footnote 56 Indeed, there are some clues that time management could play a role. According to Danziger etal., rendering favorable decisions takes more time than rendering unfavorable ones. Thus, judges might want to avoid starting more complicated cases before the break, as they would run the risk of exceeding their time limit and take their break later or not at all.Footnote 57 To assess whether the next case is a time-consuming one, superficial hints can be enough—just think of the thickness of the file, a post-it from a clerk, or the wording of the applicant’s request. Such a planning decision could lead to a decreased likelihood that positive cases are among the cases with a higher—or later—order within a session. Could this be a possible explanation for the effect that we see before the breaks?
Social psychologist Andreas Glöckner came up with an elegant way to test this alternative explanation. Glöckner uses a computer simulation. This method is complementary to an observational approach: Instead of observing and analyzing existing data, a simulation generates data. Thus, it answers the question what data would look like under certain specified conditions. The results can then be contrasted. In his argument, Glöckner makes the assumption that the judges, or their staff, have external clues as to whether a case will take more or less time. This assumption is plausible and has grave effects.
To get an idea of how strong the effect of a planning decision is, Glöckner simulates the decisions of a judge who is not subject to errors and distortions due to hunger or fatigue. How would such a judge handle her workload under time-constraints? Using a statistical program,Footnote 58 Glöckner generates 10,000 randomly ordered cases, of which thirty-six percent are positive and sixty-four percent negative, thus mirroring the proportions in the original study. Then, he generates response times that follow the averages reported by Danziger etal. These averages vary for positive and negative rulings with the positive rulings taking significantly more time.Footnote 59 To mirror that a judge does not have unlimited time at her disposal, Glöckner sets a time limit for decisions at sixty minutes. In the simulation, the judge handles cases until the next case would go beyond the time limit. This models the situation in which a judge would recognize from an external clue that the next case cannot be dealt with without going over a time limit. The data that the simulation generates can then be contrasted with the observations from the original study.
It turns out that even in the case of a judge to whom cases are presented in random order, the probability of positive decisions decreases sharply towards the end of the session. The closer to the end of the session, the fewer positive cases appear.Footnote 60 The explanation is straightforward: Because negative cases can be dealt with in a shorter timeframe, the proportion of cases that can still be dealt with in the remaining time before the break is greater among the negative cases. This reduces the likelihood that a session will end with a favorable decision—leading to a bias in the sample. At the same time, this also explains why the probability of a favorable decision is higher at the beginning of a session right after a break. It is more likely that a session will end on an unfavorable decision, with a favorable decision being postponed to the beginning of the next session.Footnote 61 Thus, a real judge would generate a similar pattern, if only she operates under a time limit and has a rough clue about whether she will have to write a short or long opinion. This holds even if her judgment is not distorted by hunger or fatigue.
Yet, Glöckner identifies another problem. Since the sessions in the original study were of different lengths, Danziger etal. compared only the first ninety-five percent of the cases. Could this “censoring” of data also have biased the sample? Glöckner simulates a judge who has no information about the complexity of the case but has to stop when a time limit is reached. As positive cases take longer to process, the likelihood increases that it is a positive case on which time will run out. Thus, the decision by Danziger etal. to disregard the cases on which the sessions end, affects positive cases more than negative ones. This increases the relative frequency of negative cases in the sample. This censoring of the last 5% of the observations also intensifies the downward trend of favorable decisions in the previous scenarios of a judge who makes planning decisions.Footnote 62
The simulations show that an explanation other than hunger or fatigue is indeed likely. If we accept the plausible assumption that at least some planning occurs, we can explain the results also on the basis of a judge who is not subject to any distortion.Footnote 63 This could then also be described as a case of reverse causality. It is not the position of the case in the order of all cases that turns the case into a positive or negative one. It is much rather its characteristic as positive or negative that makes a case more likely to appear at a certain position in the order of all cases. This explanation is not sensational but very plausible. So, given the doubts about the magnitude of the effect, the factual basis on the assumption of a random order, and the plausibility of time management as an alternative explanation, it is safe to say that the results of the study are not credible. In sum, based on the discussed evidence, there is no “hungry judge” effect.
D. The Delicate Reception of Social Science Evidence in Law
Notwithstanding the strong doubts about the study’s validity, the “hungry judge” study is an uncomfortable case for legal scholarship for several reasons. First and foremost, it is a study that—from the outside—appears to be carried out thoughtfully and rigorously. Then, while the effect size is implausible, effect sizes are not something legal scholars are trained to discuss. Importantly, most legal scholars must rely on other secondary clues. The study was published in a highly reputed outlet and endorsed by the most famous authority on cognitive bias research.Footnote 64 Thus, it carries the secondary signals that legal scholars typically—and with good reason—rely upon when appraising the output of a research field that is not theirs.Footnote 65 Ultimately, what makes the study especially hard to deal with is that it confirms an intuition, or popular narrative, which is already shared. Possibly, our wish to see our intuition confirmed is even stronger than our lawyerly caution. I see three main takeaways from this case that relate, first, to research on judicial bias;Footnote 66 second, to the appraisal of social science from a legal perspectiveFootnote 67 ; and, third, to the problem of selective presentation and reception of social science evidence in the AI discourse and beyond.Footnote 68
I. Research on Judicial Bias
The first main takeaway relates to the research on influence of non-legal factors on judicial decision-making. The state of this research should be treated with caution. On the one hand, there is a solid body of research on the influence of certain factors like, for example, ideology or demographics on the decision-making of apex courts, with a particular focus on the Supreme Court of the United States.Footnote 69 This body of research can be described as well-developed and robust. The Supreme Court has, for a long time, been at the center of scholarly attention in political science scholarship, and many empirical studies provide evidence to support theories on judicial behavior.Footnote 70 On the other hand, there is research that looks into the influence of variables on judicial-decision making that is not as strongly grounded in theory, but rather mirror intuitions.Footnote 71 Furthermore, some of this research appears to primarily be after sensational results. Apart from the effect of hunger on parole decisions, this research examines the effects of sleepinessFootnote 72 or sport resultsFootnote 73 on sentencing, or weather on asylum decisions.Footnote 74 While this research is valuable, it cannot be considered as robust as research on ideology because it wrestles with methodological problems. Scholars like Holger Spamann have regularly dismantled such approaches.Footnote 75 Spamann himself has taken a different route in his research. In a study co-authored by Lars Klöhn, they examine the effects of legally relevant and irrelevant information about a defendant on the assessment of legal questions in an experiment with actual judges, thus having full control of the information that the judges receive.Footnote 76 This line of research, which has recently been substantially extended,Footnote 77 not only covers potential non-legal influences, but also the influence on intra-legal factors on the techniques of judicial decision-making. All this is certainly not meant to say that judicial decision-making is not susceptible to bias or error,Footnote 78 but sensational studies that sound too good to be true are better met with a healthy dose of skepticism. Thus, should such a sensational study matter for a policy argument, we would be well served to check how the study has been received within the relevant community. While much of this extra-legal disciplinary knowledge might be implicit, recent years have made debates in other disciplines much more accessible. Podcasts and blogs have played their part in this development, as the aforementioned example of Daniel Lakens’ work illustrates.
II. Appraising Social Science From a Legal Perspective
Beyond the specific case of research on bias in judicial decision-making and with regard to social science studies more generally, a delicate balance must be struck. In particular, two extremes should be avoided: while a single study should not be overestimated as reliable evidence, neither should its useful potential to alert us be ignored. Still, the latter potential must be put into perspective.
The problems of the “hungry judge” effect already show why we would be ill-advised to justify a legal reform with a single study.Footnote 79 It cannot be ruled out that later studies refute primary findings or, at least, put their informative value into perspective. This leads to the next question: when does social science evidence suffice for justifying a legal intervention? This is a demanding normative question, especially as legal scholars are typically not trained in social science methods.Footnote 80 A plausible criterion or starting point would be whether we can regard scientific knowledge as consolidated. Such consolidation takes place generally through replication,Footnote 81 meta-analysis of several studies,Footnote 82 or even through the growth of a field that leads to a denser discursive control through peers. The question of if the consolidation meets the standards required for justifying a particular legal intervention remains a question legal scholars have to answer.Footnote 83
However, one could say that studies like the one discussed, provide potential to challenge us. This potential might indeed be valuable. It makes us question whether we are as immune to bias or as objective as we would like—or as the law requires us—to be.Footnote 84 Indeed, the study by Danziger etal. might be used solely as a rhetorical device, without founding a policy-argument on it.Footnote 85 Still, we should acknowledge that even such rhetorical use is not harmless. Even if a certain irritating effect might be welcome, this does not justify spreading studies which are problematic from a methodological standpoint and which “paint the justice system worse than it actually is.”Footnote 86 The public perception of the justice system matters. Exaggerations can undermine trust in the legal system. Such trust, however, is a precious societal resource, as the judiciary critically depends on public support to maintain its independence. As readers of this journal are aware, judicial independence is currently under pressure in many countries.Footnote 87 Studies like the one discussed are easily exploited politically. In fact, German far-right populists have already used the “hungry judge” effect to demand less funding—a “diet”—for judges and more severe sentences for criminals.Footnote 88
III. The Problem of Selective Reception of Social Science Evidence
In court, the establishment of the facts is a highly contested matter. Each side presents their version of the relevant events to make a favorable judgment more likely. Similarly, in the context of legal or policy choices, the danger of a selective reception of social science evidence looms large.
The reasons for selective reception may be several. Regarding the “hungry judge” effect, our preference for narratives that confirm what we already thought is on full display. Psychologists might even speak of “confirmation bias” or even “motivated reasoning.”Footnote 89 The problem may not appear serious at first sight, but we should be aware that our responsiveness to narratives may be exploited. Like in court, the underlying strategy is a selective presentation of the facts that are supposed to constitute a problem—“problem gerrymandering,” so to speak.Footnote 90 A study like the one discussed can be employed to frame a problem in a certain way, thus making a particular, favored intervention more likely.Footnote 91 Critically, problem construction and proposed solutions are typically entangled.Footnote 92 As the “hungry judge” study has become part of the debate on “AI and law,” there is a danger that we will come to wrong conclusions on the necessity of interventions, or that we will possibly choose the wrong ones. Imagine that—based on the “hungry judge” study—Israeli legislators had decided to entrust the decision of bail to an algorithm. This would have created new risksFootnote 93 without even fixing an existing problem.
E. Conclusion
The “hungry judge” effect is famous, but—as we have seen—there are strong doubts about its validity. Therefore, we should not rely on it to motivate the introduction of automated decision-aids nor other policy-interventions.Footnote 94 The case presented, here, should also make us vigilant regarding our use of social science evidence more generally. Especially when an empirical study fits our preferred narrative but at the same time sounds too good to be true, caution is advisable. Similarly, at the end of this article, we should not be too sweeping with our conclusions. Simply because the discussed piece of evidence may be flawed, this certainly does not mean that judges cannot be biased or that computer-based decision-aids cannot be put to socially beneficial use. But when it comes to human flaws and biases in judging, more collaborative research is needed.Footnote 95 This, in turn, would lead to a better assessment of where AI can actually be of help.