For the past decade, psychology has been in the midst of a replication crisis. Large, high-profile studies have found that only about half of the findings from behavioral science literature can be replicated—a discovery that has cast a long shadow over psychological science, but that has also spurred advocates to push for improved research methods that boost rigor.
Now, one of the first systematic tests of these practices in psychology suggests they do indeed boost replication rates. When researchers “preregistered” their studies—committing to a written experiment and data analysis plan in advance—other labs were able to replicate 86% of the results, they report today in Nature Human Behaviour. That’s much higher than the 30% to 70% replication rates found in other large-scale studies.
“They’re showing that by adopting these more stringent experimental protocols, other labs are able to replicate the work, which I think is very important,” says David Peterson, a sociologist of science at Purdue University who was not involved with the work. But he and others warn that the replicated studies may have been different in other ways, too, so the results may not generalize to other research.
More than a decade ago, in the early days of what has come to be known as the replication crisis, four behavioral scientists from different subdisciplines got to talking about the causes of the problem. Some psychologists thought low replicability in psychology is the result of questionable research practices, such as asking multiple questions of the data and only publishing the analyses with the best results. But other researchers argued that human behavior is inherently variable, so behavioral science may always struggle to reproduce results. To disentangle these possibilities, the scientists decided to see what would happen if they all got their labs to conduct rigorous experiments, and then tried to replicate each other’s work.
Each of the four labs submitted four studies for replication. These covered a variety of behavioral science topics including trust, self-control, the effects of advertising, and how people behave in groups.
The experiments were conducted using each lab’s typical practices, with one important condition: The researchers behind them preregistered their work. That included writing down their hypotheses, procedures, and data analysis plans and sharing the plans with the project coordinator before they launched their experiments. Advocates of preregistration say this commitment to a predetermined plan makes it harder for researchers to cherry-pick interesting findings and bury negative results.
Each of the four labs then tried to replicate all 16 studies with large sample sizes of more than 1500 participants. The original researchers communicated with one another only via a project manager to try to keep the replications as independent as possible, to reflect the usual conditions when researchers attempt to replicate another group’s work.
Across all 64 replications, 55 found the same effect as the original study—an 86% replication rate. In previous replication studies, scientists attempting to replicate other labs’ work found smaller effects than the original work. In the new study, however, the researchers’ replications produced effects of similar size to those found by the original labs.
It’s the first replication attempt that has followed studies from their conception through to independent replication, says Brian Nosek, executive director of the Center for Open Science and one of the four lab directors. Rather than choose a sample of studies from the literature retrospectively, he and his colleagues wanted to track whether they could more easily replicate work that had tried to improve rigor right from the start: “And we succeeded!”
However, the fact that the researchers chose which of their studies to put forward for replication gives some pause to Berna Devezer, a metascientist at the University of Idaho (UIdaho) who was not involved in the work. The 16 findings used in this study were chosen very differently from past replication studies—they “are not randomly selected from, or representative of, a well-defined literature,” she says.
The authors also checked whether their results looked different when they used other ways to define a successful replication. Some of those methods put the replication rate as low as 71%. But highlighting higher estimates and burying lower rates in the details of the paper is “disturbing and ironic,” says UIdaho metascientist Erkan Buzbas, “because some of the authors are ardent proponents of not cherry-picking results.”
Low replicability may continue to be a problem in behavioral science, where complex subjects such as human or animal behavior mean huge numbers of changing variables, Peterson says. But this study shows high replicability of a certain kind of study—with relatively simple designs and easily mimicked methods—is nonetheless feasible: “It’s possible to do this, if you really embrace these methods … we’re [not] forever trapped in this regime of low replicability.”
