The use of algorithms in the public sphere is exploding. Algorithms have been applied in criminal justice, Footnote 1 voting, Footnote 2 redistricting, Footnote 3 policing, Footnote 4 allocation of public services, Footnote 5 immigration, Footnote 6 military and intelligence decision-making, Footnote 7 and a range of other sensitive fields. Given the resource constraints of government agencies, who lack the resources to build their own systems and to pay the premium associated with hiring high-level software engineers, most of these algorithms are developed in the private sector. Footnote 8 For example, the Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) software, which has been used for assessing the risk of defendants in bail hearings and in criminal sentencing in at least four states and was the source of heated controversy about racial bias and a case before the Wisconsin Supreme Court, Footnote 9 was developed by Northpointe (now Equivant). Similarly, the PredPol system for identifying areas for additional police presence was developed by PredPol, Inc. (now Geolitica) and has been used by at least 20 U.S. law enforcement agencies to help inform their policing practices. This system has also been highly controversial due to concerns that it disproportionately identifies communities of color for additional police presence. Footnote 10
Sociological studies suggest that decision-makers give a high level of credence to algorithms in making decisions in spite of concerns over bias, accuracy, and myopic optimization. Footnote 11 Sentencing algorithms, for example, have received broad bipartisan support for their expanded use in Congress. Footnote 12 There is also increasing evidence that the public trusts algorithms, at least in terms of their behavior in response to advice, as much or more than other sources of advice in a range of scenarios, including in public policy, Footnote 13 although some scholars still note significant aversion to algorithms in some contexts. Footnote 14 The seeming increase in public trust in algorithms runs counter to recent studies on trust in human experts, which suggest a growing negative sentiment and general anti-intellectual attitudes among the public, Footnote 15 even as they find them more persuasive than non-expert information sources. Footnote 16
In response, ethicists and legal scholars have raised two, interrelated concerns. First, public decision-makers might rely too much on algorithms because of their perceived “objectivity” and “efficiency.” Footnote 17 These tendencies have been labeled “technogoggles,” Footnote 18 “technowashing,” Footnote 19 or “math washing” Footnote 20 by critics, who see these tools as either a way for public decision-makers to base their decisions on “more objective” criteria and/or a way for avoiding responsibility for difficult, discretionary decisions. Scholars worry that the use of algorithms adds a sense of legitimacy to otherwise contested decisions Footnote 21 and allows for scapegoating the algorithm for mistakes. Footnote 22 These claims are quite prominent, appearing in several multi-award-winning and bestselling books Footnote 23 and throughout a collection of essays by top scholars in the field of artificial intelligence (AI) ethics. Footnote 24
Second, even if a policymaker has doubts about an algorithm’s information, they may feel pressured to comply with the algorithm’s recommendations. Footnote 25 An elected official or bureaucrat who accepts an algorithm’s judgment could pass the blame along to a flawed algorithm if there is an adverse outcome, while one who rejects the algorithm’s judgment would have to explain why they rejected the (correct) information given to them. Footnote 26 Risk-averse political actors, these scholars fear, will face strong incentives to maintain the political cover algorithms provide. Surden’s scenario is worth quoting at length, since, even though we were not aware of it at the time we designed our study, it mirrors our setup in many ways Footnote 27 :
“[J]udges have incentives not to override automated recommendations. Imagine that a Judge was to release a defendant despite a high automated risk score, and that defendant were then to go on to commit a crime on release. The judge could be subject to backlash and criticism, given that there is now a seemingly precise prediction score in the record that the judge chose to override. The safer route for the judge is to simply adopt the automated recommendation, as she can always point to the numerical risk score as a justification for her decision.”
He goes on to note that this is ethically problematic for at least three reasons: (1) the numeric scores pose a “problem of false precision,” wherein the numeric scores are divorced from practical meaning; (2) the use of the scores produces a “subtle shifting of accountability for the decision away from the judge and toward the system”; and (3) the use of private, proprietary algorithms produces a “shift of accountability from the public sector to the private sector.”
Albright provides some evidence that this process is playing out in actual bail decisions. Footnote 28 Looking at bail decisions in Kentucky, she finds that a lenient recommendation by the sentencing algorithm increases the likelihood of a lenient bail decision by about 50% for marginal cases. She posits that this is because judges believe that, if they make a retrospectively incorrect decision, at least part of the blame will fall on the recommendation instead of themselves. There have also been reports of the inverse mechanism. For example, in a story by The New Yorker about AI systems for evaluating the mental health of students in schools, a school therapist suggested that he would be unlikely to go against an AI evaluation that a student was potentially suicidal because of potential liability. Footnote 29
These concerns are compounded by the development of most of these algorithms within the private sector. Intellectual property protections for the algorithms means that the public is often unaware of what data is being used or how that data is being modeled to produce the resulting predictions. Footnote 30 While advocates for the use of algorithms note that this is not too much different from the inscrutability of human motivations that may underlie particular decisions, Footnote 31 a growing chorus of concerns have been raised that algorithms allow biases to scale, Footnote 32 create negative feedback effects, Footnote 33 and increase discretion of agencies to pursue otherwise controversial or biased practices. Footnote 34
Yet, there are reasons to doubt whether the blame dynamics suggested in this literature will be manifest in popular opinion. While AI experts prefer to emphasize the unique mathematical and technical aspects of AI, Footnote 35 the public tends to anthropomorphize AI, emphasizing the characteristics of AI that reflect human characteristics and expertise. Footnote 36 Indeed, some studies even suggest that the public attributes intentionality to algorithm actions. Footnote 37 This anthropomorphization may undermine the hypothesized distinction between AI and other forms of expert advice amongst the public. It may also undermine the blame dynamics hypothesized in the theoretical literature. Bertsou finds that the public supports the role of experts primarily in how decisions are implemented not what decisions are made. Footnote 38 This is consistent with Agrawal, Gans and Goldfarb’s distinction of AI systems improving prediction to help humans make better judgments. Footnote 39 Even in studies that find higher weight given to advice from algorithms than from human experts, a large majority of respondents in all conditions give their own evaluation the highest weight, Footnote 40 suggesting hesitancy among the public to delegate even predictive tasks to an outside source, whether expert or algorithm. Thus, the blame dynamics posited by the above literature may not be manifested amongst the public if algorithms are viewed similarly to other expert advice, trust in which has been declining, or the public official is viewed as abrogating their obligation to use their own judgment.
This study focuses on evaluating the concerns of ethicists and legal scholars with regards to both the dislocation of blame (when a decision-maker makes a mistake in concurrence with an algorithm) and the magnification of blame (when a decision-maker makes a mistake in disagreement with an algorithm). We conducted a pre-registered experiment on two representative samples of the U.S. population to directly test these concerns in the judicial decision-making context (Study 1), a context notable for both its salience in the literature on algorithms in public policy and its centrality as an example in much of the literature laid out above. Footnote 41 It is also an area in which the private sector has been very active in developing tools of decision-makers.
Contrary to what legal scholars and ethicists have assumed, there is no significant decrease in blame for mistakes associated with a decision-maker agreeing with an algorithm–if anything, the amount of blame appears to slightly increase–though we are careful not to make too much of these substantively small effect sizes and they do not show up in our third sample (Study 2, discussed below). While we do find some increase in blame associated with disagreeing with an algorithm, this increase is not differentiable from agreeing with an algorithm or disagreeing with another source of advice. Again, these differences were small and did not show up in our third sample. Moreover, these experiments provide no evidence that the results are a product of demographics or general algorithm aversion. Footnote 42
To explain this counter-intuitive result, we conducted a second pre-registered experiment (Study 2) on a new representative sample. In this sample we found no significant difference in blame from when the decision-maker was advised by an algorithm or when they decided on their own–a result that is not terribly surprising, given the relatively small effect sizes in Study 1. However, while there are not significant differences in blame between scenarios, on average, there are two specific factors that seem to be particularly important when it comes to agreement or disagreement with an algorithm’s advice. We found some evidence that respondents’ who are more trusting of experts generally place more blame on the decision-maker when they disagree with an algorithm’s advice, while those less trusting of experts place less blame on the judge when they disagree with the advice. This helps explain why we do not find a larger magnifying effect of disagreeing with the algorithm’s advice on blame—most of the sample was below the threshold of expert trust above which we see an increase in blame, suggesting that most of the sample receiving this treatment felt that rejection of the advice was justifiable and, perhaps, even appropriate. There was also some evidence that agreeing with the algorithm’s advice results in respondents viewing the decision-maker as abdicating their duty to use their own judgment. In other words, perceptions that the judge was using the algorithm’s judgment in place of their own increases, rather than deflects, the blame placed on the judge.
In sum, we find no evidence to support the concerns of ethicists and legal scholars, at least in the judicial context. These findings are consistent with more general findings on the role of expertise in the policy process Footnote 43 and suggest that the use of algorithms is not (yet?) a special case when it comes to expert advice.
Study 1
Building on the scenarios laid out in the legal, ethics and political science literature, this study looks at what happens when a judge uses an algorithm in making a decision about whether to jail or release a defendant, and compare this with the situation where they use their own judgment, advice from a human source, or the combined advice of an algorithm and a human source. Footnote 44
Treatments
We asked respondents to read a brief scenario, very similar to that in Surden and developed in consultation with a law enforcement professional with 20+ years experience evaluating defendants for judges in three states. Footnote 45 The scenario involved a judge making a sentencing decision as to whether to grant probation to a defendant in a repeat drunk driving case. In all scenarios, the judge decides to release the defendant on probation and the defendant is subsequently involved in another drunk driving accident that kills a pedestrian.
Each respondent was randomly assigned to one of four conditions: 1. judge decides with no additional input (control condition), 2. judge decides with assistance of algorithm, 3. judge decides with assistance of a probation officer, or 4. judge decides with assistance of a probation officer and algorithm. For the advice (non-control) conditions, respondents were also randomized between whether the advice was for probation and the judge agreed, or the advice was for imprisonment and the judge disagreed. The full set of treatments are outlined in Table 1. Respondents were then asked how much blame they placed on the actors involved in the scenario for the adverse outcome. Footnote 46 Our main variable of interest is the degree of blame placed on the judge in each advice condition. This measure of blame ranges from 1 (“none at all”) to 10 (“a great deal”). Footnote 47 It is re-scaled to range from 0 to 1, so effects can be interpreted as the proportion increase in the scale.
Although there was little clear empirical guidance from previous literature about what we expect in this particular experiment, we pre-registered the following hypotheses Footnote 48 :
“Hypothesis 1: When an error occurs, a policymaker’s (judge) reliance on advice from an algorithm will reduce the level of blame compared to relying on his/her judgment alone. Conversely, disregarding the algorithm’s advice will increase the level of blame.
Hypothesis 2: The reduction in blame from relying on an algorithm will be similar to that of reliance on advice from a trained bureaucrat.
Hypothesis 3: When an error occurs, a policymaker’s reliance on advice from a hybrid system involving both an algorithm and a trained bureaucrat will reduce the level of blame more than relying on either source alone.”
Hypothesis 1 is drawn directly from the concerns of ethicists and legal scholars laid out above. Hypothesis 2 draws from Kennedy, Waggoner and Ward indicating individuals trust advice from automated systems as much or more than advice from other sources. Footnote 49 Hypothesis 3 is also drawn from Kennedy, Waggoner and Ward, Footnote 50 where hybrid systems, involving the judgment of both an expert and computer, are more trusted than either source alone.
Sample
The analysis was based on two demographically representative samples of the U.S. population. The first survey sample was collected in June 2021 using Lucid’s Theorem platform. 1,500 respondents participated, of which 923 (62%) passed the attention checks and were utilized in the study. Footnote 51 Lucid draws from a range of survey panels and automatically assigns participants to match U.S. census demographics, and has been shown to replicate a range of well-established experimental results Footnote 52 and is also utilized by many of the most prominent companies in the survey industry to get data. Footnote 53 This data was originally collected just prior to the pre-registration, but the data was not inspected or analyzed until after the registration. Footnote 54 The second sample collected 1,842 respondents as part of a Time-sharing Experiments in the Social Sciences survey, which contracts with the National Opinion Research Center at the University of Chicago through their AmeriSpeak panel. This program has been used for a range of influential social science survey experiments. Footnote 55 Data from this sample was not received until four months after the pre-registration. The samples are pooled for analysis, and no significant differences in results were noted between studies.
Analysis
The responses were analyzed using OLS regression of the form:
where $\alpha $ is the overall intercept and ${\beta _i}$ is the estimated slope coefficient (effect) of each of the i treatments, with the control condition omitted as a baseline. Footnote 56 Confidence intervals were calculated from 1,000 simulated draws from the distribution of the coefficient estimates. Footnote 57
Results
The results suggest that, irrespective of whether the judge agrees or disagrees with the algorithm, the degree to which the public says the judge is to blame for the adverse outcome increases slightly compared to when the judge makes the decision without assistance. When the algorithm recommends probation and the judge agrees, respondents, on average, place 9% more blame upon the judge in light of the tragic outcome, compared to when the judge decides without assistance. When the algorithm recommends jail and the judge disagrees, respondents place about 6% more blame on the judge (Figure 1b). Figure 1a also shows that there is no tradeoff in blame, i.e. respondents did blame the algorithms for the mistake under the agreement condition, but this did not reduce the culpability assigned to the judge.
We note that these effects are small and, therefore, should be interpreted with some caution. Footnote 58 Cohen’s d for the results ranged from 0.16 to 0.28 (see SI.8), which is in the negligible to medium-small range. While technically statistically significant in this study, such small effects are unlikely to have a strong impact on overall evaluations of the judge, and, as we note in SI.12, there was no impact found for the likelihood of voting for the judge in an election. Moreover, such small effects are less likely to regularly replicate, and, as we detail below, the significance level, though not the direction of the relationship, changes in Study 2. Interpreting these as null effects, however, still runs counter to popular expectation, and suggests agreement with an algorithm’s advice will not buffer decision-makers from blame, nor will decision-makers necessarily receive more blame when they disagree with an algorithm. Footnote 59
We also conduct hypothesis tests for the significance of differences between the treatment arms. Footnote 60 Table A9 in the SI shows that there are no significant differences in blame placed on the judge based on the source of advice. Contrary to what is posited by the theory literature above, whether the advice comes from an algorithm, human or a combination of the two, the differences are not statistically significant (p > 0.1). Thus, we fail to reject the null hypothesis that there is no differential impact based on the source of advice.
Why the concerns of ethicists and legal scholars are not borne out empirically is difficult to discern from this study. We first note that this does not appear to be a simple example of algorithm aversion, Footnote 61 as we observe similar increases in blame under the human and combined conditions in Figure 1b.
Figure 2 tests for effect moderation based on demographic characteristics Footnote 62 and generalized trust in algorithms. Footnote 63 Analysis for moderation is conducted using parallel within-treatment regression analysis to estimate the average treatment moderation effect (ATME), Footnote 64 since, unlike traditional treatment-by-covariate interactions, these values have a causal interpretation. Footnote 65 The process has the form
where ${S_{i}}$ is the potential mediator variable, ${X_i}'$ is the set of other variables, ${T_i}$ is the level of the treatment (in this case treated as present or absent, though it extends intuitively to our multi-treatment context), $\gamma $ is the OLS coefficient for the potential mediator, and ${\delta _{PR}}$ is the ATME.
We found no evidence of moderation of treatment effects based on standard respondent demographics (gender, age, race, and education). There does seem to be an ideological dimension to respondents’ pattern of blame, with respondents who identify more strongly with the Republican Party placing significantly more blame on the judge when they agree with the algorithm (ATME = 0.04, 95% confidence interval = [0.02, 0.06]), disagree with the algorithm (ATME = 0.03, 95% confidence interval = [0.01, 0.05]), or disagree with both the algorithm and human sources (ATME = 0.03, 95% confidence interval = [0.01, 0.05]). Greater trust in algorithms does reduce blame for the judge agreeing with the algorithm somewhat and produces the most promising results (ATME = −0.110, 95% confidence interval = [−0.23, 0.004]), it has no discernable effect when the judge disagrees with the algorithm (ATME = −0.025, 95% confidence interval = [−0.14, 0.09]) and produces opposite and insignificant results when the judge agrees with combined advice from a human and algorithm (ATME = 0.03, 95% confidence interval = [−0.09, 0.05]) or disagrees with this combined advice (ATME = 0.05, 95% confidence interval = [−0.07, 0.17]). Footnote 66
Study 2
Given the surprising results from Study 1, we conducted a further study to both try replicating the results a third time and further explore why we received these results. We pre-registered three additional hypotheses.
-
(1) Perceptions of blame are moderated by general trust in expert advice, Footnote 67 with those more trusting of experts showing decreased blame when the judge agrees with the algorithm and increased blame when the judge disagrees. Footnote 68 The basic idea behind this hypothesis is that algorithms, as suggested in Study 1, may be viewed similarly to other sources of expertise. For those with greater trust in expert advice generally, agreeing with an algorithm, even if it ends up being incorrect, will seem a natural and justifiable course of action. Conversely, for those who are skeptical of expert advice, the rejection of such expert advice, and reliance on one’s own intuition, will be more likely viewed as a reasonable course of action.
-
(2) Use of advice changes expectations of accuracy, with an increase in blame under the advice conditions being a result of greater expectations that the judge should have gotten the ruling correct. Given that previous studies have found people to have relatively high behavioral trust in algorithms, especially in these situations, Footnote 69 it is possible that the use of algorithms increases the expectations of a correct ruling. Failure to meet these expectations may result in greater blame for an incorrect ruling.
-
(3) Mistakes made when using advice are, retrospectively, seen as an abdication of responsibility, Footnote 70 with those who either think the judge did not use their own judgment or should have relied more on their own judgment increasing the blame placed on the judge. The treatment, and the subsequent incorrect ruling, may be increasing perceptions that the judge should have relied on their own judgment. Such retrospective biases, that an obvious course of action was not taken by a political decision-maker, are not unusual in other areas of politics, even when the actor has little to no control over the outcome. Footnote 71
The sample for this study was collected in March 2022 from Luc.id with a sample size of 1,400 participants. The study was once again pre-registered with these hypotheses prior to being fielded. Footnote 72
Treatments
The experimental design was nearly identical to the previous study with two notable exceptions. First, we included only three treatments from the prior study: (1) control, (2) judge agrees with the algorithm’s recommendation, and (3) judge disagrees with the algorithm’s recommendation. We did this both to focus on the most relevant treatments and to ensure appropriate statistical power for mediation analysis—conducting this for all of the treatments from Study 1 would have increased costs well beyond our research budget. Footnote 73 Second, we included three additional measures which we used to test for moderation and mediation effects. The first measure was an index of respondents’ trust in experts, with respondents rating their degree of distrust for seven different types of experts. Footnote 74 The second measure assessed the post-treatment expectation that the judge should have made an accurate decision. We measured this by asking respondents on a scale from “never” to “always,” how often they think the judge should have made the correct decision in the scenario. The third measure assesses the degree to which respondents believe that the judge is abdicating responsibility based on the treatment. This was measured using two post-treatment questions. The first assessed the degree to which the respondent thought the decision reflected the judge’s evaluation versus that of the advice-giver. The second assessed the degree to which the respondent thought the decision should have reflected the evaluation of the judge or the advice-giver. Footnote 75
Sample
Analysis is based on a demographically representative sample of the U.S. population gathered from Lucid. Of the 3,656 respondents that participated, 1,423 (40%) passed the attention check and participated in the study.
Analysis
Analysis for moderation based on trust in experts was done using the same within-treatment parallel-regression method discussed in Study 1 for estimation of the ATME. Analysis for causal mediation followed the protocol developed by Imai et al. and Imai, Keele, and Tingley, Footnote 76 and involved the estimation of two equations
${Y_i}$ is the outcome of interest, ${T_i}$ is the treatment, ${M_i}$ is the potential mediator, ${X_i}$ is a set of control variables, and ${\alpha _1}$ and ${\alpha _2}$ are intercepts. Since T is randomly assigned, it is independent of the error terms, e i1, e i2 ⫫ X. The first equation estimates the effect of the treatment on the mediator. The second equation simultaneously tests the effect of the mediator and the treatment on the outcome. The average causal mediation effect (ACME) is estimated by calculating ${\beta _1}*\gamma $ . Confidence intervals for this value were estimated using nonparametric bootstrapping. Footnote 77
Results
In Figure 3 we replicated the analysis conducted above on the direct effect of the treatments. Interestingly, while the direction of the treatment effect remained the same, and still contradicted the expectations from previous literature about increasing blame when the judge disagreed with the algorithm and decreasing blame when the judge agreed with the algorithm, the magnitude of the results is lower and does not reach standard levels of statistical significance (p > 0.05). While these results lack statistical significance, we should note the general consistency with previous results and that this lack of statistical significance does not necessarily prevent successful analysis of moderation or mediation. Footnote 78 Moreover, as noted above, such issues are not entirely surprising, given the relatively small effect size in the previous experiments. The results still provide relatively strong evidence against the hypotheses of the theoretical literature. Using the test developed by Gelman and Carlin, Footnote 79 the probability of Type S error–i.e., the probability that we would see a significant effect in the opposite direction–is 1.1% for when the judge agrees with the algorithm and 0.8% for the judge disagreeing with the algorithm. The results still refute the concerns laid out by previous scholars, although with weaker evidence, and there is no evidence of a statistically significant difference between agreeing or disagreeing with the algorithm (p > 0.1).
Figure 4 shows evidence that trust in experts moderates the response of a judge disagreeing with an algorithm. Figure 4a–c show the regression line for trust in experts in each treatment condition. In the control condition, the effect is nearly flat–trust in experts does not affect blame when the judge is making the decision on their own. When the judge agrees with the algorithm, there is a slight, but insignificant, decrease in blame. Finally, when the judge disagrees with the algorithm, there is a significant (p < 0.001) and positive relationship between blame and the amount of trust the respondent places in experts. Comparing 4A and 4C, it is notable that the blame placed on the judge only exceeds that of the control condition at the highest levels of trust in experts, encompassing a minority of our sample. For those less trusting of expert advice, the rejection of advice from an expert appears to be seen as justified. Figure 4d summarizes these results. Respondents with the highest level of trust in experts place about 33% more blame on the judge than those with the least trust in experts, when the judge disagrees with the algorithm. Conversely, they place about 17% less blame on the judge when the judge agrees with the algorithm. Footnote 80
We find little evidence that changes in expectations mediate the amount of blame. Figure 5 shows these results. Figure 5b and d show that individuals with higher accuracy expectations do place significantly more blame on the judge for their decision (p < 0.001). However, there is no significant impact in 5A between agreeing with the algorithm and expectations that the judge should have arrived at a correct decision, and 5C shows that the relationship between disagreeing with the algorithm and the expectation of accuracy is negative. Footnote 81 Expectations may be an important explanation of blame generally, but they do not appear to link use of advice and greater blame.
There is some evidence for Hypothesis 3–perceived reliance on advice is viewed as an abdication of the judge’s responsibility, and this increases blame (i.e., the judge should have figured it out on their own). Both post-hoc evaluations of the relative role of the judge and the algorithm and assessments of which should have had greater weight have significant average causal mediation effects (ACMEs) when the judge agrees with the algorithm, but are unrelated with the judge disagreeing with the algorithm. Footnote 82 Figure 6a shows that there is a significant relationship (p < 0.001) between the judge agreeing with the algorithm and respondents suggesting that the judge should have been more hands-on in making the judgment. Similarly, Figure 6b shows the significant relationship (p < 0.001) between this attitude that the judge should have more control over the decision and the amount of blame placed on the judge for the decision. Figure 6d summarizes this relationship, showing that about 66% of the effect of agreeing with the algorithm is mediated by respondents saying that the judge should have relied more on their own judgment in these situations. This is also borne out in the comments left by respondents, who regularly emphasized the importance of the judge exercising their own judgment (e.g., “they hold the office”). Footnote 83 There is certainly some level of retrospective bias in this result, but such evaluations are not uncommon in assessing the performance of public officials, even for circumstances beyond their control. Footnote 84 There are, however, also some reasons for being cautious about the mediation results. Mediation analysis, in general, is criticized by some scholars for being vulnerable to confounding. Footnote 85 There are also some sample-specific issues and indications that the observed ACMEs are not overly robust. Footnote 86 Nevertheless, this does suggest an interesting path for future inquiry, and, at a minimum, provides significant evidence that use of algorithms risks perceptions of dependence and re-assessments of the role they should play in decision-making when inevitable mistakes are made. Indeed, some of this could account for the quick turn of public sentiment against private company generated algorithms like COMPAS and PredPol in recent years, as the problems of accuracy and bias have become more apparent. A number of jurisdictions have dropped their contracts with the associated companies in recent years or decided not to sign new contracts under increased public scrutiny. Footnote 87
Discussion
These results have profound implications for policymakers as the use of algorithms in public decision-making grows. First, while more work is needed to tease out the extent to which the results above hold in different circumstances, there is a clear indication that public decision-makers will be held accountable for their decisions and any adverse consequences of those decisions. We find no significant evidence that use of algorithms for advice decreases blame on decision-makers when they agree with the advice or uniquely increases blame when they disagree with the advice. Nor do we find that algorithms hold any special place as a source of advice relative to human sources. This is consistent with other work on experts in the policy implementation process, which finds that the public believes experts should assist in how a decision is implemented, not what decisions are made. Footnote 88 In addition, these results are consistent with studies suggesting the public is increasingly resistant to expertise as a justification for policy action. Footnote 89 If a policymaker is contemplating use of an algorithm to assist in decision-making, the basis for that decision will need to be made based on efficacy, not whether it will shield them from criticism or resolve the anxiety of incorrectly assessing risk. From the perspective of ethicists and legal scholars, the results may be something of a double-edged sword. While we found none of the issues with democratic accountability about which some worry, the results also suggest that algorithms, usually implemented under the label of “evidence-based practices,” will not provide a shortcut for addressing the mass incarceration problem in the U.S.
Second, to the extent that a policymaker believes an algorithm will result in better decision-making, the logic underlying the algorithm’s decision process needs to be explainable. Legitimacy in the government sphere, and especially in the legal area, requires justification for actions. Footnote 90 This addresses the concern raised in Study 2 about respondents viewing concurring with the algorithm as an abdication of the judge’s responsibilities to use their own judgment. Judges must be able to explain why the algorithm influenced their decision, beyond simply parroting the final analysis of the algorithm, as was infamously done by the judge in the Loomis v. Wisconsin case. This can be problematic with the modern architecture of computerized decision aids, which can rely on complicated machine learning architectures that are difficult to explain in natural language. Footnote 91 Being able to explain the complicated mathematics of a statistical model is, however, not really what is being demanded of these decision aids. Using the tools of behavioral and counterfactual analysis, algorithms have a unique ability to answer questions about how specific factors would change the results from the algorithm. Even if policymakers cannot explain the complex weighting of the machine learning model used to generate the forecasts, this provides direct insight into the impacts of sensitive factors like race and income on recommendations. Footnote 92 This allows for decision-makers to justify the basis for the decision and address accusations of unfairness. Auditing tools leverage this fact to identify potential ethical issues with algorithms, Footnote 93 and such tools may also prove useful in elucidating the decision process and providing the required explanations and justifications for policy decisions. State-of-the-art documentation and auditing is, however, still relatively rare. Legislation currently being considered in Congress (H.R. 6580; S. 3572), requiring auditing and impact assessments for some entities may assist in this process.
Finally, this research offers fertile ground for further research exploring the perceived role and implementation of algorithms in public policy and politics, as well as how these are framed for the public and the effects of this framing. More specifically, we believe it is vital to continue to explore how the implementation of computer algorithms may exacerbate distrust among both the public and elites, affecting potential policy implementation and subsequent success. Moreover, we believe our results highlight the need to further explore the role of algorithms in a similar context to human experts, as the type of role the algorithm is designed to serve may impact the degree of support and trust from the public. Other contexts may also open the possibility of exploring more complete counterfactuals. In this study, the only counterfactual being evaluated is when the judge releases someone who goes on to commit a new crime. We could not evaluate whether the results were similar if someone was jailed unnecessarily, since whether they would have committed a crime if released cannot be realized. Yet, there is some evidence that the public does not evaluate false positives and false negatives in the same way, Footnote 94 and the implementation of these algorithms is inherently tied to these calculations of relative risk. Footnote 95 Other scenarios may help provide a more complete picture of how different types of errors affect blame on public officials. Finally, we note that, while the authors of this study attempted to produce experimental treatments that were relatively minimal while maintaining realism of the scenarios, consistency with theory, and clarity of the treatment, there were differences between how the treatment conditions and the control condition were worded in order to ensure fidelity to theory and clarity of the treatment. This raises the possibility that responses to advice conditions might be highly sensitive to framing effects. Footnote 96 While still suggesting that the concerns of theorists that algorithms have unique and significant influence, in themselves, are likely overblown at present, future scholars may find a productive area of research in exploring how the framing of expert and/or algorithm advice affects public perceptions, or even exploring how public perceptions would be shaped by more intensive interventions like public deliberation. Footnote 97 Our study provides a baseline on which these studies can proceed.
Supplementary material
To view supplementary material for this article, please visit https://doi.org/10.1017/bap.2023.35