Comment on The Irreproducibility Crisis: John Staddon

May 02, 2018 |  John Staddon

Font Size  


Comment on The Irreproducibility Crisis: John Staddon

May 02, 2018 | 

John Staddon

The State of Science

This is a much-needed report. I applaud the thoroughness of David Randall and Christopher Welser.  They have sought out essentially every relevant reference on this vital topic.  They have put their finger on the major problems with statistically based - NHST - research.  Occasional criticism of the report as “politically motivated” because it does not share the critics uncritical acceptance of climate science and AGW is completely unwarranted. Climate science has many respectable, honest critics

I would just like to add a few comments on non-NSHT methods plus a couple of remarks on topics not emphasized by Randall and Welser: correlation vs. causation, individual differences and finally, the nature of science and what might be done to fix things. Many of these issues are discussed at greater length in my book: Scientific Method: How science works, fails to work and pretends to work.

Non-statistical science

Not all experimental science requires statistics.  Much of experimental psychology involves phenomena that are reversible, so can be immediately replicated with no need for statistics.  Most of the phenomena of sensation and perception - judging brightness or loudness, most visual illusions - as well as reflex properties, short-term memory, even IQ measures, can be easily repeated.  There is no replication crisis in these areas.

Another area is learning and conditioning, especially in animals.  Much research on operant conditioning, for example, involves exposing subjects to a reward schedule such as fixed interval (FI): the hungry pigeon or rat gets a bit of food for the first peck or lever press 60 seconds (say) after the last bit of food.  No matter what the animal’s past history, after a few dozen rewards delivered in this way, essentially every subject shows the same pattern: wait for a time (proportional to the interfood interval) after each reward, then respond until the next reward.  This pattern will recur every time the animal is exposed to the procedure, no matter what its experience in between exposures.  Many other reward schedules produce repeatable behavior patterns like this. No statistics are required. Experimental results are replicable, although no one in this area has felt the need to think of the field in this way - any more than Faraday worried about the replicability of his easily repeated experiments with coils and magnets.

Not all learning experiments are like this, and for good reason.  Although the animal in an operant conditioning experiment will show the same behavior pattern on the same schedule, he will relearn it faster on the second exposure than the first.  In other words, his behavior may look the same on these two occasions, but his state is not.  Many learning effects are not reversible.  Hence many, perhaps even the majority, of animal-learning experiments are done with groups.  See, for example, this typical paper and many others in this journal and several others.  These studies are subject to all the strictures that apply to the NHST method in general.  

Although I have not done a careful count, it is my impression that in this general area there are many more journal papers of the second sort, using groups of subjects and analyzed statistically, than papers of the first sort, using individual animals.  This has always puzzled me because the effort involved in group studies, which require many animals, seems so much greater than the effort required to study a handful of individuals.  The reason, I think, is that the group/statistical method is algorithmic - you can just follow ‘gold-standard’ rules - and, because one in twenty experiments are pretty much guaranteed to give a positive result (because of the too-generous 5% ‘significance level’).  The group method is like what operant conditioners call a variable-ratio schedule, results are strictly proportional to effort. The NHST method usually can be relied upon to produce more positive results (questionable as many of them now turn out to be) than the single-subject method.  It is for this reason much more congenial to the perverse incentives under which most scientists must work (see the NAS report and this link).  Perhaps this is the reason for the continued popularity of this approach to animal learning.

Correlation vs. causation

Causation is much discussed, especially in disciplines such as economics, where experiment, the most direct and obvious test for causation, is usually impossible.  It is a truism that causation cannot be reliably inferred from correlation. The reason is that A may be correlated with B not because A causes B (or vice versa), but because both are caused by (unknown variable) C.   Statistician R. A. Fisher famously commented: “He [statistician Udny Yule] said that in the years in which a large number of apples were imported into Great Britain, there were also a large number of divorces. The correlation was large, statistically significant at a high level of significance, unmistakable.”  We don’t know what third cause, C, may lie behind this odd correlation. But we can be pretty sure that the apples did not cause the divorces, or vice versa.  But when A and B are linked in the popular mind, like fat and ill-health or cancer and smoking, the medical establishment happily lets correlation stand for cause.    

The only sure way to prove causation is by experiment.  No amount of correlation, of apples with divorces or even smoking and its various supposed sequelae, is as good as actually presenting the cause and getting the effect.  And doing this repeatedly. 

Problems arise when experiment is difficult or even impossible.  There are often either ethical or practical problems.  Does passive smoke cause cancer? We cannot present people with something that may give them cancer; and we can’t wait fifty years (70 is the average age at which lung cancer is diagnosed) before reaching a conclusion. In such cases, correlation is all we’ve got. 

When experiment is impossible, scientists, health-policy officials and science journalists should above all be cautious. 

They are not.

Google “secondhand smoke and cancer”, for example, and this is what you get:

Secondhand Smoke and Cancer - National Cancer Institute…Approximately 3,000 lung cancer deaths occur each year among adult nonsmokers in the United States as a result of exposure to secondhand smoke (2). The U.S. Surgeon General estimates that living with a smoker increases a nonsmoker's chances of developing lung cancer by 20 to 30 percent (4).

No reservations, no qualifications: there is “no safe level of smoking”. This is typical of most health warnings: crying not “wolf” but a whole pack of wolves. 

There is in fact pretty good correlational data about the effects of secondhand smoke.  The problem is: it is rarely cited. The most extensive recent study with any credibility looked at 35,000 never-smokers in California with spouses with known smoking habits.  The reference appears in the citations for Chapter 7 of the 2006 Surgeon General’s 709-page report - but is not discussed in the text.  The participants were selected from 118,000 adults enrolled in late 1959 in a cancer-prevention study.  The researchers asked a simple question: are Californians who are married to smokers likely to die sooner than those married to non-smokers?   Their answer is unequivocally equivocal:

No significant associations were found for current or former exposure to environmental tobacco smoke before or after adjusting for seven confounders and before or after excluding participants with pre-existing disease. No significant associations were found during the shorter follow up periods of 1960-5, 1966-72, 1973-85, and 1973-98…The results do not support a causal relation between environmental tobacco smoke and tobacco related mortality, although they do not rule out a small effect. The association between exposure to environmental tobacco smoke and coronary heart disease and lung cancer may be considerably weaker than generally believed. 

Do these data imply “3,000 lung cancer deaths each year” and an increase in the chance of “developing lung cancer by 20 to 30 percent”?  Of course not.  The official line is little more than fabrication.

Even when the correlation is weak, the official recommendation is strong.  The asymmetrical payoff matrix for health-and-safety officials - big cost if you miss a real risk like the thalidomide horror, vs. small benefit if you say nothing - seems to have produced a H&S culture that is both risk-averse and vocal. Almost anything that is correlated with ill-health, no matter how weakly, is treated as a cause. Loud warnings, taxes and prohibitions follow.

Individual differences

The father of experimental medicine, Claude Bernard once wrote “Science does not permit exceptions.”  But statistics, the NHST method, exists because of exceptions. If an experimental treatment gave the same, or at least a similar, result in every case, statistics would not be necessary.  But going from the group, which is the subject matter for statistics, to the individual, which is the usual objective of psychology and biomedicine, poses problems that are frequently ignored.

A couple of examples may help. In the first case, the subject of the research really is the group; in the second, the real subject is individual human beings, but the group method is used. 

Polling uses a sample to get an estimate of the preferences of the whole population.  Let’s say that our sample shows 30% favoring option A and 70% favoring B.  How reliable are these numbers as estimates of the population as a whole?  If the population is small, the answer depends on the size of the sample in relation to the size of the population.  If the population is little larger than the sample, the sample is likely to give an accurate measure of the state of the population. 

If the population is large this method is not possible.  For this, a model is needed. One possibility is to compare sampling to coin tossing.  Each binary subject choice is like the toss of a biased coin. In the case of the example, the estimated bias is heads: 0.3, tails: 0.7.   If our sample is 100, we can then ask: if the bias really is 30/70, what would be the chance of getting samples with biases differing from 30/70 by (say) 5%?  Such an estimate of the margin of error depends only on the sample size (number of coin tosses).  Whether the population is large or small, the aim is to draw conclusions not about the individual decision makers, but about the population as a whole.  The method does not violate Bernard’s maxim.  Since the conclusion is about the group, there are no exceptions

Experiments in social science are aimed at understanding not the group or the population as a whole, but the individual. For example, in 1979 Daniel Kahneman and Amos Tversky came up with a clever way to study human choice behavior.  This work eventually led to many thousands of citations and an economics Nobel prize in 2002.  K & T consulted their own intuitions and came up with simple problems which they then asked individual subjects to solve.  The results, statistically significant according to the standards of the time, generally confirmed their intuition. They replicated many results in different countries.

In a classic paper, in one case, subjects were asked to pick one of two choices: A: 4000 (Israeli currency) with probability 0.2 or, B: 3000 with probability 0.25.  65% of the subjects picked A, with an expected gain is of 800 over B, 35%, with an expected gain of 750.  This represents rational choice on the part of 65% of choosers. 

K & T contrast this result with an apparently similar problem, A: 4000, p = 0.8, vs. B:  3000, p = 1.0. In this case, 80% of subjects chose B, the certain option with the lower expected value, an irrational choice.  K&T then use this contrast, and the results of many similar problems, to come up with an alternative to standard utility theory. This new theory does not apply to either minority in these two problems, neither the 35% who chose irrationally in the first case or the 20% who chose rationally in the second.  Prospect theory, as K & T called it, is a list of effects - labeled certainty effect (this case), reflection, isolation, quantum and subsequently several others.

Each effect is presented as a property of human choice behavior, the necessary and sufficient conditions for which are the appropriate choice question or questions. Because the group effect is reliable, individual exceptions are ignored.  There is little doubt, for example, that given a lesson or two in probability, or phrasing the question slightly differently (e.g., not “What is your choice? But “What would a statistician choose?”), would greatly change the results. Nevertheless, the theory has gained some acceptance as an accurate picture of individual human choice, even though it is valid only for a subsection of the population under one set of conditions.  In short, prospect theory is not a theory about individual human beings but about the behavior of groups - groups large enough to allow positive results from standard tests of statistical significance. Kahneman and Tversky were not alone in treating their group theory as a theory of individual choice behavior.  The field of NHST made and continues to make the same mistake: a significant group result is treated as a property of people in general.

The aim of these experiments, unlike the polling examples I gave earlier, is to understand not groups but individuals. Even if a study is perfectly reproducible, to go beyond a group average, the experimenters would need to look at the causal factors that differentiate individuals who respond differently.  What is it, about the constitution or personal histories of individuals, that makes them respond differently to the same question? Solving this problem, satisfying Claude Bernard’s admirable axiom, is obviously much tougher than simply asking “do you prefer 3000 for sure or a 0.8 chance of 4000?” But until this problem is solved, both prospect theory and expected-utility theory - not to mention numberless other psychological theories - give a distorted picture of individual human behavior.

What is to be done?

I agree with the problems identified in the report and with many of the remedies proposed. But I do have reservations on some points.  I very much agree that the flawed NHST method would be improved by changing from Fisher’s 5%, devised for a very different kind of problem than basic research, to a more rigorous 1%. There are also more fundamental problems with using the NHST method to find out about individuals, as I have just pointed out. Openness, free availability of data, is obviously essential. It is totally unacceptable for any scientist to hide the data on which his conclusions are based. This is surely an absolute. But attempting to regulate scientific practice in great detail is probably a mistake.

Experimental science is a Darwinian process: variation and then selection.  The variation is the generation of testable hypotheses through observation and study of previous experimental work. Hypotheses, however generated, are then tested experimentally -selection - and the process repeats.  The generation of hypotheses is a creative process; it is not subject to any but the most general rules.  So, I would hesitate to recommend more control of experimental methods. The Darwinian analogy also suggests that negative results will be much more frequent than positive results. It is likely, therefore, that publishing all negative results, as the report suggests, is not feasible. On the other hand, publishing disconfirmations of generally accepted results is not only feasible but essential.

Several thoughtful critics , like the authors of this report, have argued persuasively for pre-registration of hypotheses as a partial solution to the problem of treating ‘found’ results as confirmed hypotheses. I see a serious problem with this level of regulation. Pre-registration presupposes that scientists must be coerced to do their research honestly and not to treat an accidentally significant result as if they began with it as a hypothesis.  A researcher who is both competent and honestly desirous of finding the truth simply would not behave in that way. Forcing a dishonest or badly motivated researcher to pre-register will just test his ingenuity and challenge him to come up with some other shortcut to quick publication.  There is no substitute for honesty in science. In the absence of honesty, imposing rules like pre-registration will produce rigid and uncreative work that is unlikely to advance knowledge and may well retard it.

The fundamental problem, which is not really addressed by the report, is motive.  If a scientist really wants to get to the truth, he (or she) will attend to all the strictures about reproducibility that are highlighted in this invaluable report.  Unfortunately, it is all too clear that Professor Wansink, the report’s poster-boy for publish-at-any-cost, is representative of very many social and biomedical scientists. Too often the aim is not knowledge, but publication. The problem of motive cannot be solved by administrative fiat. Perhaps the incentives can be made less malign. Perhaps the problem cannot be solved at all in a system where doing science is a career not a vocation.  Perhaps we simply have too many scientists who are unsuited to their task?


John Staddon is James B. Duke Professor of Psychology and Professor of Biology, Emeritus, at Duke University. His most recent book is Scientific Method: How science works, fails to work or pretends to work. (2017) Routledge.


Image Credit: John Staddon/Duke

There are no comments for this article yet.