Several readers have sent this paper, some of which raises a genuine question about the status of psychological findings about implicit bias and stereotype threat to which I'd be interested in hearing from experts. (The authors clearly have an agenda of their own, and some of their points are rather tendentious, which is why it would be good to hear from other experts about whether their representation of the literature is fair.) In particular, I'm curious whether any of the claims below the fold are correct:
Beebee and Saul refer to a study by Amber E. Budden et al. that reported a 7.9 percent increase in female first authors in Behavioral Ecology in the four years that followed the adoption of double-blind review by the journal, which is supposed to suggest that there is gender bias favoring male authors.[28] However, questions have been raised about this conclusion drawn from the Budden et al. study. For example, it has been observed that, in the same period, other ecology journals also published more papers by women without switching to double-blind review, which suggests that the increase might have been due to an increase in submissions by female authors.[29] In response, Budden et al. wrote that their “study was observational and that the changes occurring at the journal where double-blind review was introduced might be due to alternate variables.”[30] Moreover, other studies by Budden et al. did not confirm the existence of gender bias.[31] Significantly, Nature almost immediately retracted its earlier report on the 2008 Budden et al. study in the following way:
After re-examining the analyses, Nature has concluded that ref. 1 [Budden et al.] can no longer be said to offer compelling evidence of a role for gender bias in single-blind peer review. In addition, upon closer examination of the papers listed in PubMed on gender bias and peer review, we cannot find other strong studies that support this claim. Thus, we no longer stand by the statement in the fourth paragraph of the Editorial, that double-blind peer review reduces bias against authors with female first names.[32] (Italics added.)
In their review of the literature, Stephen J. Ceci and Wendy M. Williams conclude that “[t]he preponderance of evidence, including the best and largest studies, indicates no discrimination in reviewing women’s manuscripts.”[33] In the light of these facts, Beebee and Saul should not continue citing the Budden et al. study as authoritative.
How about grant applications? Haslanger and Saul cite a study by Christine Wennerås and Agnes Wold that is supposed to have shown that “women needed to be 2.5 times as productive as men to get a grant [from the Swedish Medical Research Council].”[34] The Wennerås and Wold study made a big impact although it was based on a rather small sample of only 114 applications submitted for postdoctoral fellowships that were to be offered in 1995. Oddly enough, it is rarely mentioned that just six months after their article appeared, Nature published a study by Jonathan Grant et al. that relied on much more comprehensive evidence. The authors looked at 1,741 grant applications to the Wellcome Trust and 1,126 grant applications to the Medical Research Council (in the UK). They concluded that “this study has shown no evidence of discrimination against women.”[35]
More recently, Ulf Sandström and Martin Hällsten investigated 280 grant applications submitted to the Swedish Medical Research Council in 2004.[36] Their conclusion is that “female principal investigators receive a 10% bonus on scores.”[37] More generally, Ceci and Williams report that “the weight of the evidence overwhelmingly points to a gender-fair review process” in grant funding.[38] Their conclusion is based on a number of smaller studies from different countries (including the abovementioned study by Grant et al.) as well as on six large-scale studies, including one by Herbert W. Marsh et al. that “found no significant gender differences in peer reviews of grant applications.”[39] A similar conclusion was reached more recently by two primary studies: one by Marsh et al., which focuses on Australian grant applications, and one by Ruediger Mutz et al., which focuses on Austrian grant applications.[40] Together these recent studies involved more than 30,000 reviews of grant applications, and neither found clear evidence of sex discrimination....
Beebee, Haslanger, Carole Lee, Christian Schunn, Jesse Prinz, and Saul cite a study by Rhea E. Steinpreis et al. that involved sending CVs to two groups of psychologists: one group received the CV with a male name, the other group received the same CV with a female name.[42] The psychologists tended to judge the CV more favorably when they had received it with a male name—although the difference disappeared when participants took the candidate to be a tenure applicant rather than a job applicant. Corinne A. Moss-Racusin et al. obtained a similar result in a study involving participants from the natural sciences (physics, chemistry, and biology).[43]
These results should strike one as surprising: Why does there appear to be gender bias in the evaluation of CVs, but not in the evaluation of grant applications? Whatever the explanation, the difference in sample size between the two kinds of research is a reason to place more weight on the latter: the grant studies were conducted on a much larger scale, involving thousands (e.g., 6,000–18,000 in the cited primary studies) of reviewers instead of the few hundred (127–238) that were recruited for the CV studies....
this is not the result one would expect if there were a tendency to favor male CVs. To be sure, the result is not inconsistent with the CV studies, because there is usually an explanation for why real-life data differ from experimental findings. For example, the experiments on which the CV studies are based have an artificial feature that may have affected the responses: participants were asked to evaluate a single (male or female) candidate, whereas real-life hiring normally involves comparisons between (male and female) candidates.
Implicit Association Tests
An explanation of the underrepresentation of women that many philosophers (Antony, Beebee, Haslanger, Margaret Crouch, Saul) find compelling is “implicit bias”: unconscious attitudes and beliefs that affect our explicit judgments about women.[45] As evidence of this, Antony, Beebee, Haslanger, Prinz, and Saul cite the study by Steinpreis et al. about the effect of having a female rather than male name appear on a CV.[46] Beebee and Saul cite, in addition, the study by Budden et al. about the effect of having a female rather than male name appear on a paper submitted to a journal.[47] Our previous section, however, has already cast doubt on these two studies. Moreover, neither study controls for attitudes or beliefs about women that are consciously held, so they are at best inconclusive evidence for unconscious bias against women....
In particular, [philosophers have] cite[d] a psychological test that is known as the Implicit Association Test, or IAT.[49] The test is designed to measure the strength of one’s unconscious associations by comparing one’s reaction times in certain classification tasks. Typically, the test involves two related tasks. For example, one is first asked to place items (e.g., “biology”) into the categories “women or science” or “man or literature”; subsequently, one is asked to place items into the categories “women or literature” or “man or science” (of course, the order can be reversed). If one is faster in placing the correct items into the category “women or science” (first task) than into the category “women or literature” (second task), then the association between the first two concepts will count as stronger; and likewise for “man or science” and “man or literature.”
Unfortunately, building one’s case for unconscious bias against women on the IAT is risky, because the test has been the subject of serious and ongoing controversy. Roughly, concerns about the IAT are raised on three counts:
- measurement assumptions (for example, the way in which differences in test scores are supposed to correspond to differences in association strength between concepts such as “women” and “science”)
- possible confounders (perhaps the difference in reaction time can be explained by factors that do not imply bias or prejudice)
- predictive value (i.e., whether the IAT actually predicts discriminatory behavior)[50]
It is impossible to go into all the arguments for these concerns, but the relatively low test-retest reliability of the IAT should certainly make one wary of its evidential status. For example, according to one of the most prominent advocates of the test, Anthony Greenwald, the average test-retest reliability of the IAT is 0.56.[51] Moreover, Willliam A. Cunningham et al., who also support the IAT, report a test-retest reliability over a two-week period that is as low as 0.27 (although they attribute this partly to measurement error)....[52]
Returning to the literature on the IAT, it’s notable that the 2009 article by John T. Jost et al.—cited by Saul as a rejoinder to IAT critics—was not the last word written on the subject.[59] In the same year, Hart Blanton et al.’s meta-analysis concluded that the IAT does “not permit prediction of individual-level behaviors.”[60] To be sure, their meta-analysis did not cover all studies. One study that they wanted to include was by Laurie Rudman and Peter Glick, which Jost et al. claimed “no manager should ignore.”[61] But the data for Rudman and Glick’s study had been “lost.”[62] In 2013, a larger meta-analysis appeared that concluded that IATs are “poor predictors” of discriminatory behavior....[63]
Stereotype Threat
The phrase “stereotype threat” refers to a situation in which subjects tend to underperform on a given task because they are afraid of confirming a negative public stereotype about their group. A number of philosophers think that this phenomenon plays an important role in the underrepresentation of women in philosophy....
Although no research about stereotype threat in philosophy has been done, the argument is that since the phenomenon has been empirically confirmed in other academic disciplines it is safe to assume that the same effects must be present in philosophy as well. But the basis for this extrapolation is more dubious than many believers in stereotype threat think. In reality, there is a lot of skepticism among psychologists about the significance of stereotype threat, including that it exists.
A number of studies were unable to replicate the stereotype threat effect, particularly in so-called high-stake situations, which are most relevant for potentially explaining real-life disparities between men and women in academia.[73] One of the scholars who got negative results for stereotype threat is John A. List, professor of economics at the University of Chicago, who comments:
"So we designed the experiment to test that, and we found that we could not even induce stereotype threat. We did everything we could to try to get it. We announced to them, “Women do not perform as well as men on this test and we want you now to put your gender on the top of the test.” And other social scientists would say, that’s crazy — if you do that, you will get stereotype threat every time. But we still didn’t get it."[74]
So, what is going on here? Why do different empirical studies get contradictory results? List offers this explanation:
"I think that stereotype threat has a lot of important boundaries that severely limit its generalizability. I think what has happened is, a few people found this result early on and now there’s publication bias. But when you talk behind the scenes to people in the profession, they have a hard time finding it. So what do they do in that case? A lot of people just shelve that experiment; they say it must be wrong because there are 10 papers in the literature that find it. Well, if there have been 200 studies that try to find it, 10 should find it, right?"[75]
And indeed, the problem of publication bias (also known as “the file drawer problem”) is acutely present in this area of research. For example, in a recent study on the possible stereotype threat among young females, Colleen M. Ganley at al. warn that there is “serious concern” that the alleged effect might be an illusion based on publication bias. Ganley at al. point out that while published articles had a tendency to confirm the existence of stereotype threat, none of the three unpublished dissertations showed that effect.[76] They also complain that they were unable to perform a meta-analysis because the number of available empirical investigations was too small and because many of the studies did not provide information necessary for calculating effect sizes.[77]
Another article giving an overview of the literature on gender-based stereotype threat in mathematics raises further methodological concerns and recommends caution. Gijsbert Stoet and David C. Geary undertook to analyze and evaluate all attempts to date to replicate the results of the first and most widely cited study, by Steven J. Spencer, Claude M. Steele, and Diane M. Quinn, of the alleged stereotype threat affecting women’s math performance.[78] Stoet and Geary warn about several serious shortcomings endemic in this research: an incomplete description of results (no reports about means or standard deviations), significance values being relaxed when the data matched the hypothesis, a biased presentation whereby the significant measure was highlighted in the text and abstract while the nonsignificant one is relegated to a footnote, etc. Stoet and Geary single out for criticism those scholars who draw conclusions about stereotype threat in women from experiments that did not have the control group (men). They point out, correctly, that this kind of inference is as logically problematic as “if one would conclude that a study with people who all wear clothes says something unique about people wearing clothes.”....[79]
To return to Stoet and Geary’s overview of the literature, their conclusion is not good news for stereotype threat hypothesis proponents. Stoet and Geary found only twenty studies that (a) addressed gender-related stereotype threat in adult math performance, and (b) had a similar research design to the seminal study on stereotype threat by Spencer, Steele, and Quinn. Only eleven of those twenty studies (55 percent) replicated the effect of stereotype threat (at the conventional .05 significance level). And after excluding also those studies that raise methodological worries because they selected as subjects the men and women who were known to have equal, previously measured math scores,[82] the number of studies was narrowed from twenty to ten. And of the remaining ten only three studies replicated the original results.
Given this outcome, it is not surprising that Stoet and Geary conclude that the stereotype threat hypothesis has not been confirmed, even if the clear danger of publication bias is disregarded: “Even when assuming that all failures to replicate have been reported, we can only conclude that evidence for the stereotype threat explanation of the gender difference in mathematics performance is weak at best.”[83]