1 Introduction
The naturalistic movement in human decision making was a response to what its originators believed were significant limitations of classic behavioral decision theory. Among these was a putative under-emphasis on processes and events that occur at the outset, in the course of recognizing decisional situations and identifying feasible alternatives (Beach, Reference Beach1997; Klein, Orasanu, Calderwood & Zsambok, Reference Klein, Orasanu, Calderwood and Zsambok1993). Early efforts to address this limitation included the schema-based image theory (Beach, Reference Beach1998), which proposed that decision making occurs in two phases (Beach & Potter, Reference Beach and Potter1992; Van Zee, Paluchowski & Beach, Reference Van Zee, Paluchowski and Beach1992). At the first phase, decision makers narrow the range of alternatives by applying a simple, non-compensatory, test of compatibility with their ethical standards, values, beliefs, goals, and plans (Beach & Strom, Reference Beach and Strom1989; Richmond, Bissell & Beach, Reference Richmond, Bissell and Beach1998). This test, originally called the “simple counting rule”, is conducted by tallying incompatible attributes and rejecting alternatives whose incompatibility count exceeds a threshold (Beach, Smith, Lundell & Mitchell, Reference Beach, Smith, Lundell and Mitchell1988).
The two-phase conception and the compatibility test have been applied to a variety of situations. However, most research has focused on common personal and organizational tasks, such as purchasing consumer goods, making career choices, allocating resources to product development, and selecting job candidates. The current paper brings compatibility testing to the arena of clinical decision making. It describes how clinicians make treatment decisions that incorporate distinct kinds of information, including general knowledge about treatment effectiveness that is based on clinical and epidemiological studies, and patient-specific information that is obtained through clinical examination. Clinical decision tasks have been simplified somewhat by the advent and widespread use of treatment guidelines, which identify evidence-based interventions, specify supporting evidence, and include algorithms and assessment procedures (Lomas et al., Reference Lomas, Anderson, Domnick-Pierre, Vayda, Enkin and Hannah1989; S. H. Woolf, Reference Woolf1990, Reference Woolf1993). The role and function of guidelines in clinical judgment has also fueled controversy and intensified an ongoing conflict between practitioners and service researchers (Dickenson & Vineis, Reference Dickenson and Vineis2002; Satterfield et al., Reference Satterfield, Spring, Brownson, Mullen, Newhouse and Walker2009). The latter tend to regard guideline recommendations as standards of care, and they are inclined to gauge performance by comparing clinician decisions against guideline recommendations (Drake et al., Reference Drake, Goldman, Leff, Lehman, Dixon and Mueser2001; Grimshaw & Eccles, Reference Grimshaw and Eccles2004). This tack is vehemently opposed by practitioners and others, who insist that the purpose of guidelines is to assist in the expert task of “contextualizing” general recommendations by incorporating patient-specific information (Maier, Reference Maier2006; Ruscio & Holohan, Reference Ruscio and Holohan2006; A. D. Woolf, Reference Woolf1997).
By now, the parties to the so-called “evidence debate” (McQueen, Reference McQueen2002) about guidelines as decision aids versus standards of care have marshaled such impressive support that discussion has reached a stalemate. On the one hand, guideline recommendations are based on evidence of treatment effectiveness; endorsing them overcomes well documented tendencies of clinicians to get lost in complications, take refuge in lore, and make decisions that are inconsistent both temporally and across geographical boundaries (Eddy, Reference Eddy1990; Moffic, Reference Moffic2006; Weisz et al., Reference Weisz, Cambrosio, Keating, Knaapen, Schlich and Tournay2007). On the other hand, practitioners are mandated to treat patients, not diseases. Guidelines give categorical recommendations; they cannot be expected to factor in patient specific circumstances, account for heterogeneous conditions and treatment responses, or incorporate patient values and preferences.
Of all of the solutions to this predicament that have been proposed to date, Eddy’s (Reference Eddy2005) is perhaps the most feasible. In his vision of “evidence-based decision making” (EBDM), a guideline serves as a point of reference. Its recommendations should be followed in most cases, but clinicians are expected to adapt them in light of patient-specific needs and idiosyncratic circumstances. The EBDM vision strikes a balance between evidence and application, but Eddy’s proposal lacks an alternative to what can be called “conformance testing,” or comparing treatment recommendations against a strict standard. In previous work, we proposed an alternative inspired by image theory’s compatibility test, and used the counting rule to examine how clinicians systematically factor patient-specific information into guideline recommendations (Falzer & Garman, Reference Falzer and Garman2009, Reference Falzer and Garman2010). Our study used a treatment guideline developed at the Yale Department of Psychiatry for patients with schizophrenia (Sernyak, Dausey, Desai & Rosenheck, Reference Sernyak, Dausey, Desai and Rosenheck2003). This is a progressive, five-step, algorithm that recommends treatment switches for patients who have not responded adequately to full trials of antipsychotic medication. The guideline was derived from a widely disseminated set of recommendations for treating patients with schizophrenia (Lehman et al., Reference Lehman, Steinwachs, Dixon, Goldman, Osher and Postrado1998), and developed specifically to favor less expensive generic treatments over newer (second generation or “atypical”) antipsychotic agents. The five steps represent progressive orders of unit cost (Rosenheck, Leslie & Doshi, Reference Rosenheck, Leslie and Doshi2008).
The guideline calls for treatment to begin with a course of first-generation antipsychotic therapy such as haloperidol or fluphenazine. At step two, patients who do not respond adequately after three months are switched to a different first-generation treatment. At step three, non-responders are switched to a different medication class, a second-generation treatment such as risperidone or olanzapine. Step four is a trial of a different second-generation treatment. Clozapine therapy is introduced at the fifth and final step. Switch recommendations are guided by two items from the Clinical Global Impression Scale (Guy, Reference Guy1976): A severity scale measures the patient’s illness, and a progress scale measures the treatment response. Each scale is rated from 0 to 7, with higher scores indicating greater severity of illness or a poorer response. A switch is recommended if the illness score is 4 or higher, indicating at least moderate illness, and the progress score is 3 or higher, indicating minimal improvement at best.
Our previous work (Falzer & Garman, Reference Falzer and Garman2010). used case vignettes to systematically vary four factors and asked psychiatric residents to make treatment recommendations. Subjects were introduced to the five-step algorithm and switch criteria. Vignettes were constructed from a design in which each factor had either a compatible level or a discrepant level. “Discrepant” levels are technically consistent with the guideline, but they are either inconsistent with clinical practice or introduce an additional relevant factor. Discrepant information weakens the guideline recommendation and may lead clinicians to consider an alternative. The factors and their levels were:
1. For the progress factor: “The progress score is 3, minimally improved over the past 6 months,” versus “the progress score is 6, much worse over the past 6 months.” Minimal improvement is discrepant because it barely meets the switch criteria and indicates that the current treatment may be beneficial. “Much worse” is compatible because it clearly meets the switch criteria and strongly indicates the need for a treatment change.
2. For the illness factor: “The illness score is 4, moderate illness at present,” versus “The illness score is 6, severe illness at present.” Moderate illness is discrepant because it barely meets the switch criteria, whereas “severely ill” clearly meets the criteria and strongly indicates the need for a treatment change.
3. For the adherence factor: Subjects were presented with a 4-point adherence scale. The lowest rating was “1: never/almost never takes medications as prescribed (0–25% of the time).” The highest rating was “4: always/almost always takes medications as prescribed (75–100% of the time).” The lowest score is discrepant because non-adherent patients are unlikely to benefit from a treatment change. The highest score is compatible because patients who take medication as prescribed are most likely to achieve its full benefit.
4. For the likelihood factor, subjects were presented with one of four sets of likelihoods. Three sets were discrepant because they indicated a low likelihood of a positive response to the guideline-recommended treatment. For one of the discrepant sets: “With this subset of schizophrenic patients, [following the switch recommendation] will have the following results: 10% chance of significant improvement, no longer treatment resistant; 40% chance of no significant change; 50% chance of getting significantly worse, requiring hospitalization.” (There were two other low discrepant sets: 10%-80%-10%, which suggests that a switch would probably be ineffective; and 45%-10%-45%, which suggests that a switch is risky.) The fourth set was compatible owing to a high likelihood of a positive response. The text was the same as noted above, but percentages were 50%-40%-10%, suggesting a high likelihood of significant improvement and a low likelihood of decompensation.
Note that the adherence and likelihood factors were not included in the guideline, but introduce information that clinicians would regard as relevant to making a treatment recommendation.
Each subject in the earlier study evaluated 64 vignettes. The vignettes were constructed from a 2 × 2 × 2 × 4 × 2 design. The first four variables represent the four factors described above. The last 2-level factor refers to the guideline step, 2 or 4. Overall, 42% of their recommendations concurred with the guideline. However, the endorsement rate ranged from 32% to 91%, depending on the number of discrepancies. There was a significant inverse linear relationship between the likelihood of endorsing the guideline recommendation and the number of discrepant attributes, and the discrepancy count explained 65% of the within-subject variance (Falzer & Garman, Reference Falzer and Garman2010).
Patients ask a variety of questions in the routine course of consultation, but two questions tend to predominate. They are: “what do I have?” and “what are my chances?” The first question requests a diagnostic classification; the second asks for a specific or tailored forecast. Most patients will not be satisfied with a general likelihood estimate that applies to a disease population, followed by a caveat to the effect that “all patients are different.” From the perspective of EBDM, forecasting requires expertise in using a guideline to modify a general estimate in light of relevant patient-specific factors (Visweswaran et al., Reference Visweswaran, Angus, Hsieh, Weissfeld, Yealy and Cooper2010).
The current study uses the counting rule to examine how clinicians bring a combination of general and case-specific knowledge to the task of forecasting a patient’s treatment response. The study focuses on two principal findings from image theory research: One is that probabilities are treated as attributes at the initial phase of decision making (Potter & Beach, Reference Potter and Beach1994b; Van Zee, et al., Reference Van Zee, Paluchowski and Beach1992). The other finding is that, in some situations, certain attributes have greater weight than others (Beach, Puto, Heckler, Naylor & Marble, Reference Beach, Puto, Heckler, Naylor and Marble1996). In practicing EBDM, we expect that, in making a patient-specific forecast, clinicians will give greater weight to general likelihoods than other clinical factors.
In the current study, subjects made two forecasts for each of 8 case vignettes. In one forecast, they projected the likelihood of a positive treatment response if the guideline recommendation is followed from step 3 forward; the other forecast projected the likelihood of a positive treatment response if the guideline recommendation is not followed, i.e., if the guideline recommends change and no change is made, or if the guideline recommends change to one treatment and a change is made to a different treatment. This procedure allowed us to examine how the absence of general likelihood data affects the counting rule.
So that the subjects can make these judgments, we do not provide likelihood information. General likelihoods may be absent for a variety of reasons, but most commonly because alternatives to a guideline recommendation have not been extensively investigated. Image theory studies have found that decision makers treat missing information as a violation (Reference Potter and BeachPotter & Beach, 1994a). In other words, a significant missing piece of information is treated as incompatible information. However, clinicians who are familiar with treatment alternatives and are accustomed to comparing guideline-recommended treatments with commonly-used alternatives may handle missing information differently, perhaps by adjusting the likelihoods of known treatments.
Asking subjects to make two forecasts also allows us to identify the higher forecast as the preferred alternative and examine how preference influences the counting rule. Forecasts may be biased in a variety of ways (Alexander, Reference Alexander2008; Harvey, Reference Harvey2007; Wolfson, Doctor & Burns, Reference Wolfson, Doctor and Burns2000). A “value induced” or preference bias is frequently mentioned in the clinical decision making literature to explain the ostensible tendency of clinicians and patients to make over-optimistic forecasts about favored alternatives (Gurmankin Levy & Hershey, Reference Gurmankin Levy and Hershey2006; also see Krizan & Windschitl, Reference Krizan and Windschitl2007; Levy & Hershey, Reference Levy and Hershey2008). A study by Ditto and Lopez (Reference Ditto and Lopez1992) found that preference bias is apparent in decisional processes as well as summary forecasts. Specifically, judgments are reached more quickly and require less information when they are consistent with favored conclusions. This finding suggests that preference may influence what image theory’s calls the “rejection threshold”—the point at which prospective alternatives are rejected because of too many discrepant attributes. A key finding of image theory research is that once the threshold is met, additional discrepancies have limited influence on whether to eliminate a prospective course of action from further consideration (Beach & Strom, Reference Beach and Strom1989). The influence of the rejection threshold on likelihood forecasts can be seen by plotting mean likelihoods at each violation count. Likelihoods should decrease somewhat as the number of violations increase, then drop precipitously and flatten out. The current study examines the influence of preference on the rejection threshold by proposing that favored alternatives have a higher (that is, a more generous) threshold than non-favored alternatives.
2 Method
2.1 Guideline and task
Subjects evaluated eight case vignettes that were selected from the group of 64 that the same subjects had reviewed in performing the treatment recommendation task described above (Falzer & Garman, Reference Falzer and Garman2010). The vignettes were rated in the manner described below at the step three of the guideline, after the hypothetical patient had failed to respond adequately to two courses of a first-generation antipsychotic treatment. These patients comprise the roughly 15 to 25% of patients with schizophrenia who are “treatment resistant” (Brenner et al., Reference Brenner, Dencker, Goldstein, Hubbard, Keegan and Kruger1990; Falzer, Garman & Moore, Reference Falzer, Garman and Moore2009). Ratings consisted of two forecasts of a favorable treatment response: a) if treatment followed the guideline recommendations from step three forward, and b) if treatment departed from the guideline recommendation. The forecasts were made sequentially, using a 0 to 100 scale. As with the previous study, subjects were able to consult the guideline as they performed the task and were instructed to proceed through the vignettes in the order they were presented.
As experienced psychiatric trainees, the subjects are well aware of at least three viable alternatives to following the guideline from step three forward. One alternative, especially with a partial response, is to continue the current treatment. The second is to recommend clozapine earlier than the guideline-recommended step five. There is extensive support for using clozapine for treatment-resistant patients (Falzer & Garman, Reference Falzer and Garman2012; Kane, Reference Kane2004), but despite its effectiveness, clozapine tends to be underused (Fayek, Flowers, Signorelli & Simpson, Reference Fayek, Flowers, Signorelli and Simpson2003; Mistry & Osborn, Reference Mistry and Osborn2011; Nielsen, Dahm, Lublin & Taylor, Reference Nielsen, Dahm, Lublin and Taylor2010). The third alternative is embodied in the American Psychiatric Association’s (APA’s) recommendation to introduce depot (long-acting, intra-muscular injected) medications for patients who have demonstrated poor adherence to orally administered treatments (Lehman, Lieberman, et al., Reference Lehman, Lieberman, Dixon, McGlashan, Miller and Perkins2004). Treatment with an injectable medication may begin with a first-generation formulation, even if the oral formulation was previously tried.
2.2 Study design
The eight vignettes were sorted into four random orders and then presented to the subjects at random. A fully balanced 2 × 2 × 2 design was created by manipulating three factors: general likelihood of a positive treatment response, course of the illness, and patient-specific adherence. Each factor had two levels. One level represented compatibility between the case and the guideline recommendation; the other level represented incompatibility and is treated as a violation. The factors and levels are as follows:
1. General likelihood of a positive treatment response: For the compatible level, a 50% likelihood of a positive response, a 40% likelihood of no change, and a 10% likelihood of a negative response. For the violation level, a 10% likelihood of a positive response, a 40% likelihood of no change, and a 50% likelihood of a negative response. These likelihoods may seem low, but they are consistent with current findings about the limited effectiveness of antipsychotic medication for patients with treatment-resistant schizophrenia, and the risk inherent to switching from one treatment to another.
2. Course of the illness: As described in the previous section, the treatment guideline uses two items from the Clinical Global Inventory (CGI) to assess patients’ current condition and their progress during the current treatment. In all 8 vignettes, the progress item score was “4, no change,” which calls for a switch to step 3. For the compatible level, the severity item score was “6, severely ill.” For the violation level, the condition score was “4, moderately ill.” To a layperson these scores may seem backwards, but for a trained psychiatrist a severe condition combined with lack of progress is attributed to the current treatment’s lack of effectiveness. Consequently, a severe illness and no progress is compatible with the guideline’s switch recommendation. Moderate illness combined with no progress suggests that the patient’s condition is on a stable or deteriorating course. For these patients, clozapine is the treatment of choice; alternatively, the current treatment would be continued.
3. Patient-specific adherence: As described in the previous section, adherence was high for the compatible level (75% or greater) and low for the violation level (25% or lower). There is ample evidence that low adherence reduces the likelihood of a positive treatment response (Ascher-Svanum et al., Reference Ascher-Svanum, Faries, Zhu, Ernst, Swartz and Swanson2006). As noted above, the APA guideline recommends a depot medication when adherence is low. What cannot be determined from the vignette, as in actual practice, is how the patient’s adherence is affected by the treatment regimen, and consequently whether adherence would change with a different treatment.
2.3 Hypothesis testing and data analysis
The study tested three hypotheses. The first is that clinicians use an unequally-weighted counting rule in making patient-specific forecasts that follow the guideline. The hypothesis is examined by treating the first forecast—the likelihood of a positive response if the guideline is strictly followed—as the dependent variable, and two categorical “violations” variables as independent variables. One variable ranges from 0 to 3 and represents an equally-weighted violation count. It is computed by a simple sum of the violations in each vignette. The other variable ranges from 0 and 4 and represents unequally-weighted violation count. It is computed by giving the general likelihood factor twice the weight of the course and adherence factors.
The hypothesis is tested by creating two linear mixed effects models and examining each independent variable separately. For the hypothesis to be confirmed, the unequally-weighted variable must be significantly associated with the patient specific forecast. If the equally-weighted variable is also significant, the two models will be compared using Akaike’s Information Criterion (AIC) index, a “lower is better” goodness-of-fit measure (Akaike, Reference Akaike1974). In addition, model means will be inspected for evidence of rejection threshold. Hypothesis testing uses the linear mixed model algorithm in SPSS 19 (SPSS Inc., 2011), with a diagonal covariance type. A numeric subject identifier is treated as subject-level factor; trial number (1–8) is a repeated measure.
The second hypothesis is that clinicians use an unequally-weighted counting rule in making patient-specific forecasts that do not follow the guideline. The analytic procedure is the same as for hypothesis one, with the second forecast as the dependent variable. Three mixed models are compared: The first model treats the absence of general likelihood information as a violation. The second model substitutes the violation level of guideline’s general likelihood. The third model substitutes the general likelihood as a weighted violation.
The third hypothesis examines how clinicians’ preferences that either favor or oppose the guideline influence their use of the counting rule. The hypothesis that preference significantly influences the counting rule is tested by creating a preference variable, then examining the preference by violations interaction. A significant preference by violations interaction for each rating confirms the hypothesis. The preference variable is created by subtracting the second forecast from the first for all eight vignettes. A positive difference indicates that for a given vignette, a subject favors the guideline recommendation. A negative difference indicates that the subject favors an alternative to the guideline recommendation. A difference of 0 indicates no preference. Violation factors that pertain to both ratings will be tested, provided that hypotheses 1 and 2 are confirmed. Tests of the first two hypothesis will also determine whether an equally- or unequally-weighted rule is used. If only one of the hypotheses is confirmed, then hypothesis 3 will be tested for that rating only. If neither hypothesis is confirmed, then hypothesis 3 will not be tested.
2.4 Subjects
Twenty-one volunteer psychiatric residents with experience in treating patients with schizophrenia were recruited as subjects. They were paid $100.00 to complete a one hour session that included the task described here. The funding source and two local Human Investigation Committees required that recruitment be done passively, to minimize concerns that residents’ participation could affect their status or progress in the training program. Consequently, only candidates who were interested in participating contacted the study investigator, and every candidate who contacted the investigator became a subject. The experience requirement limited the sampling frame to third and fourth year residents, and fellows with experience treating patients with schizophrenia. These criteria were verified with the candidates prior to obtaining informed consent. The residency program has no specific training in clinical decision making or in using treatment guidelines. The guideline used in this study had not been incorporated into routine clinical procedures and none of the subjects was acquainted with it.
3 Results
Of the 21 subjects, 11 were third year residents, 5 were fourth year residents, and 5 were fellows. The demographic characteristics correspond roughly to the population of the training program, with 14 males and 7 females and a mean age of 33.4 years (±3.6). Fourteen listed their race as Caucasian, 6 as Asian, and 1 as other. One male Caucasian resident identified himself as Hispanic. Mean ratings of the two forecasts were almost identical: For the first forecast, = 39.8 (±22.3) and ranged from 3 to 90. For the second forecast, = 36.4 (±19.7) and ranged from 4 to 90. Subject age, gender, and race had no significant effect on either forecast.
Based on a comparison between the two patient-specific forecasts, the first rating was higher than the second in 66 of the 168 total vignette presentations (21 X 8), or 39.3%. An alternative was favored in 67 presentations, or 39.9%. The two ratings were identical, indicating no preference, in 35 presentations, or 20.8%. The mean difference between the ratings was 20.3 (±11.56) when the guideline was favored and –12.0 (±9.07) when an alternative was favored. Seven subjects expressed a single preference in all eight vignettes. Of these eight, one subject always favored the guideline, one always had no preference, and five always favored an alternative. Ten subjects expressed multiple preferences in at least two of the eight vignettes; three of these ten subjects expressed all three preferences. Using a multinomial GEE analysis (Hardin & Hilbe, Reference Hardin and Hilbe2003), preference was predicted by year in residency (Wald χ2 = 10.597, df=2, p=.005). Third year residents were less likely to favor the guideline recommendation and fellows were more likely to favor the guideline recommendation. Consequently, resident year was entered, along with the subject identifier, as a subject-level factor in the mixed model analyses.
3.1 First hypothesis
The first hypothesis is that subjects use an unequally-weighted counting rule when they make patient-specific forecasts that follow the guideline. Mixed model analyses of the equally- and unequally-weighted counting rules are in Table 1. Both analyses show a significant inverse linear relationship between mean estimates and the number of violations, and there is evidence of a rejection threshold at 2 violations. In the equally-weighted model, the forecast drops precipitously from a mean of 46 at 1 violation to 35 at 2. In the unequally-weighted model, the forecast drops from a mean of 53 at 1 violation to 28 at 2. Bonferroni-corrected paired comparisons reported in Table 1 confirm that the drop between 1 violation and 2 is statistically significant for both models, and differences between 0 and 1 and between 2 and 3 were non-significant. The unequally-weighted model has the same pattern of paired-comparisons, but features a steeper drop at the rejection threshold and a slightly better goodness of fit, as indicated by a 1.5% reduction in the AIC index. These findings indicate that study subjects employed a compatibility test in reaching patient-specific forecasts of a positive treatment response, and support the hypothesis that general likelihood is weighted more heavily than course and adherence.
3.2 Second hypothesis
The second hypothesis is that subjects use an unequally-weighted counting rule in making patient-specific forecasts that do not follow the guideline. Tests of the second hypothesis are displayed in Table 2. Results indicate that absence of general likelihood information was not treated as a violation. The estimated means give no indication of a threshold, and neither the violations factor nor the linear contrast tests are significant. (Only the simple counting rule is tested because doubling the value of a constant does not change the results.) The alternative explanation, that subjects substitute guideline likelihoods in making forecasts, is supported by significant violations factor and linear contrast tests. These tests were significant in both the equal-weighted and unequal-weighted models. The pattern of means and rejection thresholds is similar to what was found with the first rating, and the AIC index for the unequally-weighted model is 1.5% lower. These findings indicate that study subjects employed a compatibility test in reaching patient-specific forecasts of a positive treatment response. As with the first ratings, there is limited support for the hypothesis that the general likelihood is weighted more heavily than course and adherence. Further, the fact that the two sets of findings were almost identical suggests that the two forecasts were not made independently.
3.3 Third hypothesis
Because an unequal-weighted violations variable provided a slightly better fit in both forecasts, it was used to examine the third hypothesis, that clinicians’ preferences for or against the guideline influences their use of the counting rule. This hypothesis was tested by introducing the three-level preference factor (favoring the guideline, favoring an alternative, or indifferent) as a second independent variable and examining the violations by preference interaction for each rating. Both interactions are significant: For the first rating, F = 17.199, df=14/19.104, p<.001; for the second rating, F=6.409, df=14/30.595, p<.001. Sub-group analyses illustrate the influence of preference on the counting rule, and specifically on the rejection threshold. Mean forecast ratings at each violation point for each preference are displayed in Figure 1. It shows a rejection threshold of 2 when the guideline recommendation is favored. When an alternative is favored, there is a sharp drop between 0 and 2 violations, as indicated by a significant pair-wise difference. However, the decrement between 1 and 2 is non-significant. Results are similar with no preference, except that the difference between 1 and 4 violations is non-significant, owing to a relatively large standard error (4.522 at 4 violations versus 4.128 at 2 and 4.069 at 3). In addition, the mean ratings at 3 violations were higher for the guideline-favored ratings than for non-favored or no-preference ratings. Overall, the results suggest that there is a rejection threshold of 2 when the guideline is favored. Otherwise, the rejection threshold is more generous and additional violations continue to exert an influence on the forecasts. Implications of these findings are discussed in the following section.
4 Discussion
Current discourse in clinical decision making is daunted by a conflict between those emphasize adherence to evidence-based practices (Chambers, Reference Chambers2008), and others who view clinical judgment as essential to making patient-specific treatment recommendations (Patel, Kaufman & Arocha, Reference Patel, Kaufman and Arocha2002). Questions about the value and importance of clinical judgment are routinely addressed in healthcare policy (Parks et al., Reference Parks, Radke, Parker, Foti, Eilers and Diamond2009; Rosenheck, Leslie, Busch, Rofman & Sernyak, Reference Rosenheck, Leslie, Busch, Rofman and Sernyak2008), in discussions about quality of care (Blumenthal, Reference Blumenthal1996; Zerhouni, Reference Zerhouni2003), in medical informatics (Fiol & Haug, Reference Fiol and Haug2009; Lipman, Reference Lipman2004), and comparative effectiveness research (Basu, Reference Basu2009; Helfand, Reference Helfand2009). Among the proposals that have been advanced to diminish the conflict and minimize its deleterious influence on healthcare education, policy, and practice, the most fully developed is Eddy’s EBDM (Eddy, Reference Eddy2005). It requires a guideline that makes evidence-based recommendations and clinical discretion in applying the recommendations to specific cases. Although EBDM was not developed expressly with image theory in mind, its two-phase conception of decision making complements EBDM’s conception of how evidence informs practice: At the first phase, clinicians decide whether to endorse the guideline’s recommendation; contingent on a general endorsement, they select a specific treatment.
The current study focused on the first decisional phase and examined the applicability of image theory’s simple counting rule to case-specific forecasting. It found that the counting rule describes how clinicians incorporate different kinds of knowledge. Hypotheses that clinicians weight population estimates more heavily than specific attributes were confirmed. However, the difference between the equally- and unequally-weighted counting rules was small (only 1.5%, gauged by the AIC index). The findings that all three attributes were important militates against an explanation frequently mentioned in conferences and anecdotal conversations—that clinicians adopt a “take the best” heuristic by focusing principally on a single cue (Marewski, Gaissmaier & Gigerenzer, Reference Marewski, Gaissmaier and Gigerenzer2010).
Asking subjects to make two forecasts of a positive response allowed us to examine how the treatment guideline in combination with clinician preference influences their use of the counting rule. Findings by Ditto and Lopez (Reference Ditto and Lopez1992) led us to expect that favored and non-favored preferences would have different rejection thresholds. The mean forecasts in Figure 1 confirm this expectation. What Figure 1 does not show is the relationship between rating and preference. This relationship is represented in Figure 2. It reports mean forecasts at each violation point, using a weighted count, for ratings that are consistent with preference. The broken line, which displays mean forecasts of the first rating when the guideline is favored, shows a sharp drop at 2 violations. The solid line, which displays mean forecasts of the second rating when an alternative is favored, shows a gentle slope from 0 to 2 and a rejection threshold of 3. These findings raise the possibility that guidelines—specifically, the expert use of guidelines consistent with EBDM—may have a de-biasing influence on clinical judgment (Almashat, Ayotte, Edelstein & Margrett, Reference Almashat, Ayotte, Edelstein and Margrett2008; Wolfson, et al., Reference Wolfson, Doctor and Burns2000). The stability of this finding across guidelines, illnesses, and levels of expertise, as well as its implications for education and policy, bear further investigation.
4.1 Limitations
The results should be qualified by the study’s limitations, which pertain to the subject sample, stimulus, guideline, clinical paradigm, and the experimental procedure. The data were drawn from a small sample of psychiatric trainees at a single and fairly select facility. It cannot be assumed that similar findings would have been obtained from the same study administered at a different facility or if the subjects were experienced clinicians. Nor can the results be generalized to trainees in other disciplines, such as nursing, psychology, or social work. Vignettes are used frequently in studies of clinical decision making and clinical training (Campo et al., Reference Campo, Williams, Williams, Segundo, Lydston and Weiss2008; Peabody, Luck, Glassman, Dresselhaus & Lee, Reference Peabody, Luck, Glassman, Dresselhaus and Lee2000). Nonetheless, their use remains controversial, particularly in comparing results with other procedures such as record reviews and standardized patients. A particular concern is whether vignette study data generalize to actual clinical practice (Fihn, Reference Fihn2000).
The Yale Psychiatry Sernyak guideline (YPSA) that was used in this study (Sernyak, et al., Reference Sernyak, Dausey, Desai and Rosenheck2003) was the precursor to a “fail-first” policy (a requirement that two courses of first-generation treatment be tried before a second-generation treatment can be introduced) that was instituted briefly at the VA Connecticut Healthcare System (Rosenheck, Leslie & Doshi, Reference Rosenheck, Leslie and Doshi2008). The YPSA lacks the broad consensus that is enjoyed by other schizophrenia treatment guidelines, including the APA (Lehman, Lieberman, et al., Reference Lehman, Lieberman, Dixon, McGlashan, Miller and Perkins2004), the Schizophrenia PORT (Patient Outcomes Research Team: Lehman, Kreyenbuhl, et al., Reference Lehman, Kreyenbuhl, Buchanan, Dickerson, Dixon and Goldberg2004), and the TMAP (Texas Medication Algorithm Project: Moore et al., Reference Moore, Buchanan, Buckley, Chiles, Conley and Crismon2007). Two limitations of the YPSA were noted in previous sections: clozapine, which is the single most effective medication for treatment resistant schizophrenia, is postponed until four other therapies have been tried. There is no mention of injectable treatments, which are recommended for addressing adherence problems. In addition, the YPSA has no provision for so-called “adjunctive” or combination treatments that are commonly used in clinical and community practice. Switch recommendations are based solely on ratings of two items from an established assessment scale (Guy, Reference Guy1976). These ratings provide very limited information about the patient’s condition and treatment response, and their use as switching criteria has not been tested independently.
The procedure called on subjects to make two sequential forecasts for each vignette. This procedure virtually invited them to use the guideline recommendation in the second rating rather than forecasting without a general likelihood estimate. This procedure did not provide an appropriate test of image theory’s hypothesis that missing information is treated as a violation. However, with a repeated measures design there is no clearly superior alternative. For instance, had subjects been asked for first forecasts of all eight vignettes, then instructed to go through the vignettes again and make second forecasts, they could draw on memory or believe that the study was testing the consistency of their responses. Similar problems would occur if half of the ratings were made before the guideline was presented. As an alternative, subjects could be asked to rate only one or two vignettes, but given the small subject sample this procedure would have severely limited statistical power. Some of these limitations can be addressed by comparing ratings that rely on different guidelines, or by comparing guideline recommendations against specific alternatives rather than allowing subjects to pose their own.
Forecasts were made by drawing on only three clinical factors. Subjects in the guideline’s dissemination study (Sernyak, et al., Reference Sernyak, Dausey, Desai and Rosenheck2003) identified these factors as having the greatest influence on their recommendations. However, other clinical and experiential phenomena are important, including patient perceptions of illness, stressors, coping responses, metabolic and other medical complications, and medical and psychiatric co-morbidities. The question for clinical decision makers is, at what point does additional patient-specific information over-fit the data and unduly complicate the process of forecasting? This question can be investigated by studies that vary the type, amount, and quality of information that is included in case summaries.
Perhaps the most significant procedural limitation of the current study is giving subjects information and asking them only to write their forecasts on paper. In practice, relevant information is elicited through clinical examination and forecasts are discussed with the patient in the course of treatment planning. In this study, as with many others, the communicative aspects of decision making were eliminated in order to focus on the cognitive processes of treatment providers. But in clinical and community practice, EBDM is not the sole province of providers. Active involvement of patients is both inherent and desirable, especially in treatment severe mental illness, where decisions are made progressively over a protracted period and in an evolving system of care (Nielsen, Damkier, Lublin & Taylor, Reference Nielsen, Damkier, Lublin and Taylor2011; Pincus et al., Reference Pincus, Page, Druss, Appelbaum, Gottlieb and England2007).
4.2 Conclusion
Studies of medical decision making as a shared activity were occurring long before patient centered care was formally incorporated into healthcare policy (Institute of Medicine, 2001). Early studies displayed strongly differing views about how practitioners should convey expert knowledge to patients, especially in forecasting likelihoods of disease, treatment response, and outcome (see Braddock, Fihn, Levinson, Jonsen & Pearlman, Reference Braddock, Fihn, Levinson, Jonsen and Pearlman1997; Greenfield, Kaplan & Ware Jr., Reference Greenfield, Kaplan and Ware1985; Strull, Lo & Charles, Reference Strull, Lo and Charles1984; Vertinsky, Thompson & Uyeno, Reference Vertinsky, Thompson and Uyeno1974). The issues have been clarified by recent work that has focused on concepts of numeracy, framing, and format (Gigerenzer & Gray, Reference Gigerenzer and Gray2011; Reyna, Nelson, Han & Dieckmann, Reference Reyna, Nelson, Han and Dieckmann2009; Timmermans, Ockhuysen-Vermey & Henneman, Reference Timmermans, Ockhuysen-Vermey and Henneman2008). But they continue to overlook that the sole, or even principal, purpose of quantitative forecasts in clinical practice is not prediction. First and foremost, forecasts and the factors that influence them are subjects for discussion. For instance, if poor adherence is diminishing the prospect of a good treatment response, the crucial issues are why this person is not adhering to the regimen and how adherence can be improved. Persons with schizophrenia have their own criteria for gauging the effectiveness of treatment. Whether a progress score of 3 or 4 is less important than whether they can hold a job, keep an apartment, or have a relationship. Whether these aims of treatment are accomplishable, how, and over what period, are what patients want to know when they ask the question, “what are my chances?” An appropriate presentation of quantitative information makes a forecast more understandable. Incorporating this information into a treatment narrative is what makes it meaningful.
Prior to the paper that introduced the simple counting rule, Beach and associates drew a pivotal distinction between “aleatory” (calculated) and “epistemic” reasoning (Beach, Christensen-Szalanski & Barnes, Reference Beach, Christensen-Szalanski, Barnes, Wright and Ayton1987). This distinction became a cornerstone of image theory and of Beach’s later work in narrative behavioral decision theory (Beach, Reference Beach2010). The authors portrayed decision making as an epistemic task, that “explicitly involves knowledge about the unique characteristics of specific elements and the framework of knowledge, including the casual network and set of members, in which they are embedded” (p. 147). The counting rule can be narrowly conceived as a smart heuristic, like Gigerenzer’s “tallying rule” (Marewski et al., Reference Marewski, Gaissmaier and Gigerenzer2010). More appropriately, its value lies in surmounting the polemic that dominates current discussions about evidence, decision making, and clinical practice. But it has a broader and richer place in the narrative tradition of medical decision making (for instance, Cronje & Fullan, Reference Cronje and Fullan2003; Epstein & Street, Reference Epstein and Street2011; Greenhalgh, Reference Greenhalgh1999; Kerstholt, van der Zwaard, Bart & Cremers, Reference Kerstholt, van der Zwaard, Bart and Cremers2009; Say, Murtagh & Thomson, Reference Say, Murtagh and Thomson2006), where the aim is neither to optimize nor satisfice, but spur an interactive and collaborative effort that determines “the best next thing for this patient at this time” (Weiner, Reference Weiner2004). By interpreting quantitative data in a meaningful and useful way, this process can reach an informed choice.