Replication Crisis: The Essential Guide to Why 36% Replicate

In 2011, a respected social psychologist named Daryl Bem published a paper in one of psychology’s most prestigious journals claiming to have found evidence of precognition. Nine experiments, eight statistically significant results, all suggesting that humans can perceive the future. The paper passed peer review. The statistics checked out. The methodology followed every convention the field demanded. It was a preview of the replication crisis that would soon force the entire discipline to confront how its methods could produce, validate, and publish evidence for something that almost certainly is not real.

The problem was not that Bem was a fraud. The problem was that the conventions themselves were broken. When other researchers tried to replicate his results, they found nothing. And when they submitted their null findings for publication, the same journal that published the original paper rejected them, on the grounds that it did not publish replications.

This episode did not cause the replication crisis in psychology. It revealed it. The system that could produce, validate, and publish evidence of psychic powers while simultaneously refusing to publish the disconfirmation of those powers had a structural problem that went far deeper than any single study.

What the Reproducibility Project Actually Found

In 2015, the Open Science Collaboration, led by University of Virginia psychologist Brian Nosek, published the results of the largest systematic replication attempt in the history of psychology. The team selected 100 studies from three leading psychology journals, all published in 2008, and attempted to reproduce them using the original methods, materials, and (where possible) consultation with the original authors.

The results were blunt. Of the original 100 studies, 97 had reported statistically significant findings. Only 36 of the replications achieved statistical significance. The average effect size in the replications was roughly half the magnitude of the originals. When the replication teams were asked for a subjective assessment of whether the original result had been reproduced, only 39% said yes.

These numbers deserve context. A 36% replication rate does not mean that 64% of psychology findings are “wrong” in the everyday sense. Some failures may reflect differences in population, setting, or implementation rather than a false original finding. But even accounting for these factors, the gap between what the literature claimed and what careful replication could reproduce was too large to explain away. Something systematic was inflating the results that made it into print.

The Mechanisms: How False Positives Accumulate

To understand why published results are less reliable than they appear, you need to understand three practices that are technically not fraud but produce many of the same consequences. Researchers have given them names: p-hacking, HARKing, and publication bias. They interact with each other in ways that compound the problem.

P-hacking: making the data fit the hypothesis

Statistical significance in psychology traditionally means achieving a p-value below 0.05, which roughly translates to: “If there were no real effect, there would be less than a 5% chance of seeing results this extreme.” The threshold is arbitrary but universal. Careers, tenure decisions, and journal publications all depend on crossing it.

P-hacking refers to the practice of adjusting your analysis until you cross that threshold. This can take many forms: dropping outliers, adding or removing control variables, testing multiple dependent variables and reporting only the one that worked, collecting more data until the p-value dips below 0.05, or splitting participants into subgroups until one combination produces significance. None of these steps need to involve conscious dishonesty. A researcher who genuinely believes their hypothesis is correct might make each of these decisions for defensible reasons, without recognizing that the cumulative effect is to dramatically inflate the false positive rate.

A 2012 survey of over 2,000 psychologists found that 42% admitted to collecting additional data after checking whether results were statistically significant, and 64% admitted to selectively reporting studies that “worked.” A Royal Society analysis catalogued the specific strategies and found that even modest p-hacking could inflate false positive rates from the nominal 5% to above 60%.

HARKing: hypothesizing after the results are known

HARKing stands for Hypothesizing After the Results are Known. It describes the practice of running an experiment, finding an unexpected pattern in the data, and then writing the paper as though you predicted that pattern all along. The introduction presents a clean theoretical justification. The results section delivers a satisfying confirmation. The reader never learns that the researcher tested four different hypotheses and the one in the paper was the only one that reached significance.

The same 2012 survey found that 51% of psychologists admitted to reporting an unexpected finding as though it had been predicted from the start. HARKing transforms exploratory research (which is valuable but uncertain) into confirmatory research (which appears far more convincing). It makes the scientific literature look like a series of successful predictions rather than what it often is: a collection of pattern-matches found after the fact.

Publication bias: the file drawer problem

In 1979, psychologist Robert Rosenthal described what he called the “file drawer problem.” For every study that finds a statistically significant result and gets published, there may be several studies that tested the same hypothesis, found nothing, and were never submitted or never accepted. The published literature therefore systematically overrepresents positive findings.

This is survivorship bias applied to scientific knowledge. You see the studies that survived the publication process. You do not see the ones that were abandoned, rejected, or never written up because the results were “not interesting enough.” The surviving studies create an illusion of consistency. If ten labs test the same effect and only the two that found significance publish their results, the literature will show a 100% replication rate for a finding that actually failed 80% of the time.

Publication bias is not a secret conspiracy. It is the predictable outcome of a system where journals want novel, significant findings, reviewers find null results uninteresting, and researchers need publications to keep their jobs. Everyone involved is responding rationally to the incentives in front of them. The result is a scientific record that is systematically misleading.

The Ioannidis Warning

A decade before the Reproducibility Project confirmed the problem empirically, Stanford epidemiologist John Ioannidis published a paper in PLoS Medicine with one of the most provocative titles in scientific history: “Why Most Published Research Findings Are False.” Using a mathematical model that incorporated bias, statistical power, and the ratio of true to false hypotheses in a given field, Ioannidis argued that the majority of published findings were likely to be wrong.

The paper has become one of the most cited in the history of medical literature. Its core insight is straightforward once you see it: in any field where researchers test many hypotheses, most of which are probably false, even a low false positive rate will produce a literature dominated by spurious findings. Add underpowered studies (too few participants to reliably detect real effects), flexible analysis methods, and selective publication, and the math gets worse.

Not everyone agrees with Ioannidis’s most dramatic conclusion. Biostatisticians Jager and Leek calculated the false positive rate in medical literature at roughly 14%, not the majority that Ioannidis suggested. But even the most generous estimates acknowledge that the published record contains substantially more false positives than its statistical methods would lead you to expect.

The Replication Crisis in Practice: What Fell Apart

Abstract statistics become concrete when you look at specific findings that did not survive replication.

Ego depletion was one of the most influential concepts in social psychology for two decades. The idea: willpower is a finite resource, like a fuel tank. Use self-control on one task and you have less available for the next. By 2010, over 200 published studies supported the concept. Then in 2016, a registered replication report coordinated across 23 laboratories with over 2,100 participants found an effect size of essentially zero (d = 0.04, with confidence intervals spanning from negative to barely positive). The tower of 200 studies had been built on foundations that could not support it.

The ego depletion case is instructive because it is not a story about bad science done by bad people. The original researchers were respected. The studies were well-constructed by the standards of their time. What failed was the system around them: underpowered individual studies, selective publication of positive results, and a theoretical framework so intuitive that it discouraged skepticism.

Power posing, the claim that adopting expansive body postures for two minutes causes hormonal changes and increases risk-taking behavior, became a cultural phenomenon after a 2012 TED Talk by Amy Cuddy that has been viewed over 70 million times. The original 2010 study had 42 participants. Subsequent replication attempts with larger samples could not reproduce the hormonal or behavioral effects. One of the original co-authors, Dana Carney, publicly stated in 2016 that she did not believe the effect was real.

These are not cherry-picked embarrassments. The Dunning-Kruger effect, long cited as proof that incompetent people are uniquely unable to recognize their incompetence, has faced serious statistical critiques suggesting the signature pattern may be largely an artifact of regression to the mean rather than a real cognitive bias. Stereotype threat, priming effects, and the facial feedback hypothesis have all seen significant replication failures or dramatically reduced effect sizes.

Why Psychology Was Hit Hardest

The replication crisis is not exclusive to psychology. Medicine, economics, cancer biology, and political science have all faced their own reckonings. But psychology became the poster child, and there are structural reasons why.

First, psychology studies human behavior, which is sensitive to context in ways that chemistry and physics are not. An effect demonstrated on American undergraduates at a particular university in a particular decade may not generalize to other populations, settings, or time periods. This “context sensitivity” makes replication inherently harder, but it also means that the field needed more rigorous methods, not fewer, to distinguish real effects from noise.

Second, psychology historically ran on small samples. A study with 40 participants divided into two groups has very little statistical power to detect anything but large effects. But if you combine small samples with flexible analysis and publication bias, you get a literature full of statistically significant findings that are actually noise.

Third, the incentive structure in psychology particularly rewards novelty. A study showing that a surprising intervention changes behavior is more publishable, more citeable, and more likely to generate media coverage than a study showing that it does not. The structural mechanisms behind contradictory research apply across disciplines, but in psychology they interact with a subject matter where surprising findings are especially marketable.

What Has Changed Since 2015

The replication crisis prompted what some researchers call a “credibility revolution,” a suite of structural reforms aimed at making the system more resistant to the problems described above. The progress is real, measurable, and incomplete.

Pre-registration requires researchers to publicly commit to their hypotheses, methods, and analysis plans before collecting data. This makes p-hacking and HARKing harder (though not impossible). The number of pre-registered studies has grown substantially, though a 2025 analysis found that some researchers still deviate from their registered plans without disclosure.

Registered reports go further. Journals review and accept study proposals before the data are collected, committing to publish the results regardless of outcome. Over 300 journals now offer this format. Studies published as registered reports show a dramatically higher rate of null findings compared to traditional publications, which suggests the traditional pipeline was indeed filtering out negative results.

Sample sizes have increased. In social psychology, the median sample size has risen from around 80-100 participants in the early 2010s to approximately 250 today. Larger samples provide more statistical power and more reliable estimates.

Open data and open materials allow other researchers to verify analyses and attempt replications. Many journals now require or encourage data sharing, and platforms like the Open Science Framework provide infrastructure.

These reforms have had measurable effects. Top journals now display stricter evidential standards. The pattern of almost-exclusively-positive findings that characterized the pre-crisis literature has softened. But a 2025 scoping review in Royal Society Open Science found a significant gap: of over 100 studies examining open science interventions, only 15 directly measured whether the interventions actually improved reproducibility. The reforms are plausible, increasingly adopted, and still only partially validated.

What Has Not Changed

The incentive structure of academic science still rewards publication volume, citation counts, and novel findings over replication, rigor, and null results. Researchers who spend their time replicating other people’s work produce fewer “original” papers, which can hurt their careers. Journals that publish replications attract fewer citations than journals that publish novel findings.

Pre-registration can be circumvented. Researchers can pre-register vague or multiple analysis plans, deviate from the registered plan and disclose the deviation in small print, or simply not pre-register their exploratory work and present it as confirmatory. The 2025 article “Campbell’s Law Explains the Replication Crisis” argued that pre-registration badges may become another metric to be gamed, following the pattern where any measure used as a target ceases to be a good measure.

The broader public, meanwhile, continues to encounter psychology findings primarily through media coverage that strips away uncertainty, context, and effect sizes. A study that finds a small, possibly fragile effect becomes a headline claiming that “science proves” some counterintuitive truth about human nature. By the time the replication failure arrives (if it arrives), the original finding is embedded in popular understanding and self-help books.

What This Means for Reading Research

The replication crisis does not mean that psychology is useless or that all published findings are wrong. It means that the published literature is a biased sample of the research that has been conducted, that individual studies (especially those with small samples, surprising findings, and no independent replication) should be treated as preliminary evidence rather than established fact, and that the correction mechanisms are improving but far from complete.

If you are reading a psychology study (or, more commonly, a news article about one), a few questions are worth asking. How large was the sample? Has anyone replicated it? Was it pre-registered? Does the effect size suggest something meaningful, or is it statistically significant but practically tiny? Is the finding reported with appropriate uncertainty, or is it presented as a settled truth?

Science is not broken. But the system that translates scientific work into published knowledge has failure modes that the replication crisis exposed, and fixing those failure modes is a project that is still underway. The most honest thing psychology has done in the last decade is admit that out loud.

The Replication Crisis: Why Most Published Psychology Research Cannot Be Reproduced

What the Reproducibility Project Actually Found

The Mechanisms: How False Positives Accumulate

P-hacking: making the data fit the hypothesis

HARKing: hypothesizing after the results are known

Publication bias: the file drawer problem

The Ioannidis Warning

The Replication Crisis in Practice: What Fell Apart

Why Psychology Was Hit Hardest

What Has Changed Since 2015

What Has Not Changed

What This Means for Reading Research

Sources

What the Reproducibility Project Actually Found

The Mechanisms: How False Positives Accumulate

P-hacking: making the data fit the hypothesis

HARKing: hypothesizing after the results are known

Publication bias: the file drawer problem

The Ioannidis Warning

The Replication Crisis in Practice: What Fell Apart

Why Psychology Was Hit Hardest

What Has Changed Since 2015

What Has Not Changed

What This Means for Reading Research

Sources

Related

Regulatory Capture: How Industries Take Over the Agencies Meant to Police Them

How Sleep Actually Works: The Two-Process Model, Sleep Debt, and Why Weekends Cannot Fix It

AI Companion Apps Compared: What You Actually Get, What You Actually Pay, and Who Gets Your Data

How Human Memory Works and Why Eyewitness Testimony Is Unreliable