A few weeks ago, Radiolab released a podcast short, ‘For the Love of Numbers’, on the peculiar propensity of people to become attached to specific numbers. They discussed the old Pythagorean idea that numbers are gendered, specifically that odd numbers are male and that even numbers are female, and the ways in which that thinking persists today. They cited a study by James Wilkie and Galen Bodenhausen, ‘Are Numbers Gendered?’ (Journal of Experimental Psychology, 2011).
In that study, participants were shown pictures of very young infants, dressed in white. For half the participants, the pictures were labeled with either three odd digits or three even digits. Participants were told that these numbers were randomly assigned ID numbers, and to disregard them.
The study, and Radiolab in turn, reported that evenly-labeled babies were significantly more likely to be thought female, and oddly-labeled babies male, independent of the baby’s actual sex. What the study did technically report, but what Radiolab did not, is that the effect size was miniscule.
Participants were asked to rate how likely they thought it that the baby pictured was male, with a score of 1 meaning that it was “not at all likely” to be male, and a score of 7 meaning that it was “extremely” likely that it was.
Study participants gave oddly-numbered babies a mean score of 3.79, with a standard deviation of 0.53. Evenly-numbered babies were given a mean score of 3.47, with a standard deviation of 0.59. Even considering the modest number of participants (36), this is a very small effect size.
In fact, Abumrad and Krulwich (the hosts of Radiolab) interviewed Wilkie, and asked him about the size of his effect.
“Radiolab: By how much more likely? By a lot, or by a teenie bit? Wilkie: Uh, by a statistically significant amount. Not a landslide.”
“Not a landslide.” That is an understatement.
It is difficult to discuss statistical significance, because, while it is the gold standard of scientific research, few non-practitioners understand how manipulable data really is. Significance alone is not enough: you also need to know the tests run, the size of the sample (or the “n”, for ‘independent subject’), and the standard deviation. The standard deviation in the Wilkie study is larger than the effect size; that is, if you add the mean value and the standard deviation, you have no difference between the two groups. That is a sure sign that something is amiss, significant or not.
I like Radiolab a great deal. I even liked that podcast short. But it is a show that purports to be about science, and its creators should be thinking critically about the scientific research they present. A very casual scientific consumer should know that standard deviations should not completely swallow effect size, and that authorial claims of “significant” p-values do not excuse us from healthy scepticism.