The boss asked us to look into something deceptively simple: what happens when the way we count things changes, but we pretend it didn’t? The answer, it turns out, is that we scare ourselves half to death.
Data collection bias is one of the most underappreciated forces shaping public debate. It works like this: you change how you measure something, then compare the new numbers to the old ones as if nothing changed. The result is a phantom trend, a statistical ghost that looks like a real shift in reality but is actually just a shift in bookkeeping.
This is not a niche problem. It touches climate science, medicine, crime statistics, and poverty measurement. And it has real consequences, because policy built on phantom trends is policy built on sand.
Data Collection Bias and the Autism “Epidemic”
In 2000, the CDC found that roughly 1 in 150 American eight-year-olds had an autism diagnosis. By the latest count, that number is 1 in 31, nearly a fivefold increase. Headlines have called it an epidemic. Some officials have called it a tragedy.
But the definition of autism has changed dramatically over the same period. Until 1980, the DSM classified what we now call autism as “schizophrenia, childhood type.” In 1987, the criteria widened. In 1994, Asperger’s disorder appeared as a separate diagnosis. In 2013, it was folded into the broader umbrella of “autism spectrum disorder,” and for the first time, a child could be diagnosed with both autism and ADHD simultaneously.
A landmark 2015 study in JAMA Pediatrics examined Danish children born between 1980 and 1991 and found that 60% of the increase in autism prevalence could be explained by changes in diagnostic criteria and reporting practices alone. A separate study from Columbia University found that over a quarter of the increased autism caseload in California between 1992 and 2005 was attributable to people previously diagnosed with intellectual disability being reclassified as autistic.
None of this means autism is not real, or that prevalence has not increased at all. Environmental risk factors like air pollution and older parental age likely play a role. But the headline numbers are doing something that the underlying data is not: they are comparing apples to oranges and calling it a trend.
Ocean Temperatures and the Bucket Problem
For centuries, sailors measured sea surface temperature by dropping a bucket into the ocean and sticking a thermometer in it. The problem is that not all buckets are the same. A canvas bucket left on deck for three minutes cools by about 0.5 degrees Celsius more than a wooden bucket under the same conditions. Given that total global warming over the entire 20th century was roughly 1 degree, half a degree of measurement error is enormous.
To make matters worse, different nations used different buckets at different times. Japanese fishing vessels before 1932 recorded temperatures in whole-degree Fahrenheit, later converted to Celsius. When the U.S. Air Force digitized Japanese naval records from the late 1930s, they truncated the data, chopping off decimal points. The result was an apparent rapid cooling in the Pacific that never actually happened.
A 2024 study in Nature found that existing sea surface temperature records from 1900 to 1930 are about 0.26 degrees Celsius too cold. This does not change our understanding of overall global warming since the mid-1800s, but it does alter the shape of the warming curve, making the early 20th century warming trend more gradual and less mysterious than it previously appeared.
The takeaway is not that climate data is unreliable. It is that scientists have spent decades painstakingly correcting for these biases, and that anyone who compares raw historical data to modern measurements without understanding the corrections is making a basic category error.
Cancer Screening and the Lead-Time Trap
Imagine a man who develops lung cancer at 60 but does not know it. At 67, symptoms appear. He is diagnosed and dies at 70. His survival from diagnosis: three years.
Now imagine screening catches it at 60. He still dies at 70. His survival from diagnosis: ten years. Five-year survival jumps from 0% to 100%, but not a single day has been added to his life.
This is lead-time biasA flaw in survival statistics where earlier detection makes patients appear to live longer, even when treatment changes nothing and death occurs at the same time., and it is one of the most counterintuitive problems in medicine. Screening makes survival statistics look better even when it does nothing, because it moves the starting line backward.
A related problem is overdiagnosisThe detection of diseases that would never have caused symptoms or harm during a patient's lifetime, typically through screening programs that find slow-growing conditions.: screening finds slow-growing cancers that would never have caused symptoms. The NCI estimates that 19% of screen-detected breast cancers and 20% to 50% of screen-detected prostate cancers fall into this category.
South Korea offers the most dramatic example. After the government funded cancer screening in 1999, providers began adding thyroid ultrasounds. By 2011, thyroid cancer diagnoses had increased fifteenfold. But thyroid cancer deaths stayed flat. Nearly all the newly detected tumors were small, slow-growing papillary cancers under 2 centimeters. Tens of thousands of people underwent surgery they did not need, for cancers that would never have harmed them.
Crime Statistics: Counting Differently, Not Counting More
In 2021, the FBI retired its old Summary Reporting System (SRS) in favor of the National Incident-Based Reporting System (NIBRS). Under the old system, if someone robbed a bank, stole a car, and assaulted a witness, only the most serious crime (robbery) was counted. Under NIBRS, all three are logged.
The predictable result: crime numbers can appear higher under NIBRS, not because more crime is occurring but because more of it is being recorded. The FBI itself warned that the transition would make year-over-year comparisons unreliable. Making matters worse, in 2021 coverage dropped from about 95% of the population to 65%, because many agencies had not yet switched over.
This is the same pattern we see everywhere: a change in methodology creates the appearance of a change in reality.
What to Ask When Numbers Shift
Data collection bias is not a conspiracy. In most cases, the new measurement method is genuinely better. Modern autism criteria capture people who were always autistic but were mislabeled or overlooked. NIBRS gives a more complete picture of crime than the hierarchy rule ever did. Scientists correcting sea surface temperature records are doing the work of making data more accurate, not less.
The problem is not the measurement. It is the comparison. When you see a dramatic trend line, three questions can cut through the noise:
- Did the definition change? (Autism, poverty thresholds, crime categories)
- Did the instrument change? (Buckets to engine intakes, Stevenson screens to automatic weather stations)
- Did the effort to look change? (Cancer screening programs, broader diagnostic coverage)
If the answer to any of these is yes, the trend line is telling you two stories at once: one about reality, and one about how we observe it. Untangling the two is the hardest and most important job in statistics.
The boss nudged us toward a topic that sits at the intersection of epistemology and public policy: what happens to trend data when the underlying measurement apparatus changes? The short answer is that data collection bias generates phantom signals, and those signals drive real-world decisions.
Data collection bias, sometimes called measurement bias or ascertainment biasA systematic error where changes in how cases are identified distort measured rates, creating apparent trends that reflect detection methods rather than reality. depending on context, arises when systematic changes in how data are gathered create the appearance of changes in the phenomenon being measured. It is distinct from random error, which scatters data points unpredictably. Systematic bias pushes them consistently in one direction, and it tends to compound over time.
What follows is a cross-domain survey of how this plays out in practice, drawn from climate science, epidemiology, oncology, and criminal justice.
Data Collection Bias in Autism Prevalence: Diagnostic SubstitutionWhen people who would previously have received one diagnosis are now classified under a different label, affecting prevalence statistics. and Accretion
The CDC’s Autism and Developmental Disabilities Monitoring Network has tracked autism prevalence since 2000, when the rate stood at approximately 1 in 150 eight-year-olds. The most recent data puts it at 1 in 31, a nearly fivefold increase in prevalence over two decades.
The diagnostic landscape has changed at least as dramatically. The DSM-III (1980) introduced “infantile autism” as a category distinct from childhood schizophrenia. The DSM-III-R (1987) broadened criteria across three domains: social interaction, communication, and restricted activities. The DSM-IV (1994) added Asperger’s disorder. The DSM-5 (2013) consolidated everything under “autism spectrum disorder” and, critically, removed the exclusion criterion that had prevented co-diagnosis of autism and ADHD.
Two rigorous studies have attempted to quantify the contribution of these changes. Hansen et al. (2015), published in JAMA Pediatrics, used a population-based cohort of 677,915 Danish children born between 1980 and 1991. Using a stratified Cox proportional hazards model with diagnostic changes modeled as time-dependent covariates, they found that 60% of the increase in ASD prevalence (95% CI: 33%-87%) could be attributed to two reporting changes: a shift in diagnostic criteria in 1994 and the inclusion of outpatient contacts in 1995.
King and Bearman (2009), working with 7,003 California DDS patient records, introduced a useful distinction between diagnostic substitution (patient goes from diagnosis X to diagnosis Y) and diagnostic accretion (patient keeps diagnosis X and acquires Y as a comorbidity). They found that 26.4% of the autism caseload increase between 1992 and 2005 was uniquely attributable to individuals previously diagnosed with intellectual disability who subsequently acquired an autism diagnosis, with odds of diagnostic change elevated 1.55 to 1.82 times during periods when diagnostic practices shifted.
A further piece of evidence: autism prevalence varies enormously by state, from 53.1 per 1,000 in California to 9.7 per 1,000 in Texas. The CDC’s own report attributes this primarily to differences in screening intensity, not to real differences in underlying prevalence.
This does not rule out a genuine increase. Older parental age, air pollution exposure during pregnancy, and improved neonatal survival for premature infants are all plausible contributors. But the signal is inseparable from the noise of diagnostic change without careful statistical decomposition.
Sea Surface Temperature: Buckets, Engine Intakes, and Data Truncation
The global sea surface temperature (SST) record, maintained by NOAA as the Extended Reconstructed SST (ERSST) dataset, extends back to 1854 and draws on over 155 million observations from ships of every description. The challenge is that measurement methods have changed radically over that period, and each method carries its own systematic bias.
Before World War II, most SST measurements used bucket sampling. The transition from wooden to canvas buckets around the late 19th century introduced a cold bias because canvas buckets lose heat to evaporation faster. Harvard researchers quantified this: a canvas bucket left on deck for three minutes under typical conditions reads approximately 0.5 degrees Celsius colder than a wooden bucket. After the war, engine-room intake measurements became dominant, introducing a warm bias of 0.1 to 0.5 degrees Celsius from engine heat.
A particularly revealing case comes from the Pacific. Chan et al. (2019), published in Nature, found that Japanese fishing vessel records before 1932 were recorded in whole-degree Fahrenheit, converted to Celsius, then rounded. When Japanese naval records from the late 1930s were digitized by the U.S. Air Force, the Celsius decimals were truncated entirely. This created an apparent rapid cooling in the Pacific between 1935 and 1941 that was purely a data-processing artifact.
In the North Atlantic, German ships dominated data collection in the late 1920s. Their measurements, concentrated in a single data “deck,” ran systematically warmer than those of nearby ships, creating an apparent regional warming that was partly measurement bias.
A 2024 Nature study by Dobrynin et al. confirmed a broader pattern: SST records from 1900 to 1930 carry an average cold bias of approximately 0.26 degrees Celsius relative to independently reconstructed land-based temperatures. Correcting this bias does not change estimates of total warming since the mid-19th century, but it significantly alters the shape of the warming curve: the early 20th century warming becomes more gradual, decadal-scale variability decreases, and the agreement between model simulations and observations improves.
Separately, a 2011 GAO audit found that 42% of active U.S. Historical Climatology Network stations did not meet NOAA’s own siting standards, which require distance from paved surfaces and obstructions. However, NOAA’s own analysis has found that poorly sited stations do not significantly bias long-term temperature trends after adjustments, a point worth noting for intellectual honesty.
Cancer Epidemiology: Lead-Time BiasA flaw in survival statistics where earlier detection makes patients appear to live longer, even when treatment changes nothing and death occurs at the same time., OverdiagnosisThe detection of diseases that would never have caused symptoms or harm during a patient's lifetime, typically through screening programs that find slow-growing conditions., and the Will Rogers PhenomenonA statistical artifact in medicine where reclassifying patients into higher disease stages improves survival rates for both groups, though no patient is actually better off.
Medical screening introduces multiple layers of data collection bias simultaneously, which is why oncology has produced some of the clearest examples of measurement artifacts being mistaken for real trends.
Lead-time bias occurs when screening advances the date of diagnosis without changing the date of death. As the NCI explains: a man diagnosed with lung cancer at 67 who dies at 70 has a 5-year survival rate of 0%. Screen the same man at 60, and he still dies at 70, but 5-year survival is now 100%. No life has been extended; only the observation window shifted.
Overdiagnosis is the extreme case: screening detects cancers so slow-growing they would never cause symptoms during the patient’s lifetime. Dunn et al. (2022), writing in the Journal of the National Cancer Center, review the evidence across multiple cancer types. Studies estimate 19% of screen-detected breast cancers and 20% to 50% of screen-detected prostate cancers are overdiagnosed. The neuroblastoma screening program in Japan detected tumors that spontaneously regressed, leading to its termination after population-based screening failed to reduce mortality.
South Korea provides the most striking case study. After the government launched a national cancer screening program in 1999, fee-for-service providers added thyroid ultrasound. By 2011, thyroid cancer diagnoses had increased fifteenfold over 1993 levels. Mortality was unchanged. A nationwide study found that 94.4% of the increase consisted of tumors smaller than 2 centimeters, and 97.1% were local or regional cancers. After public awareness campaigns, thyroid operations dropped 35% in a single year.
The Will Rogers phenomenon, described by Feinstein et al. in 1985, adds another layer: when improved diagnostics reclassify patients from lower stages to higher stages, survival statistics for both stages appear to improve, even though no patient is actually better off. The name comes from Will Rogers’ quip that when Okies left Oklahoma for California, they raised the average intelligence of both states.
Data Collection Bias in Crime Statistics: The NIBRS Transition
The FBI’s 2021 transition from the Summary Reporting System (SRS) to the National Incident-Based Reporting System (NIBRS) illustrates a textbook case of methodological discontinuity.
Under SRS, a single incident was classified by its most serious offense (the “hierarchy rule”). A robbery-assault-theft incident counted as one robbery. Under NIBRS, all offenses in an incident are recorded, up to ten per event. NIBRS also captures offense types that SRS did not track at all, including intimidation, human trafficking, and kidnapping.
The immediate statistical consequence is that NIBRS-reported crime counts can be higher than SRS counts for identical levels of criminal activity. The FBI’s own 2015 report acknowledged that agencies feared “the public, media, and government officials will misinterpret the apparent change in crime and attribute the increased crime counts to failed policing administration and leadership rather than a change in how the crime data are being reported.”
Compounding the problem, the 2021 transition was incomplete. Population coverage dropped from approximately 95% to 65% because many agencies had not converted. This created a simultaneous change in both methodology and sampling frame, making 2020-to-2021 comparisons essentially meaningless for any rigorous purpose.
The Pattern Across Domains
The common thread is straightforward: when measurement changes, trend data breaks. The specific mechanisms vary:
- Diagnostic expansion: broader criteria capture cases that always existed (autism, ADHD)
- Instrument substitution: new tools carry different systematic biases (buckets to engine intakes, Stevenson screens to automated weather stations)
- Screening intensity: looking harder finds more of what was always there (thyroid cancer, neuroblastoma)
- Reporting methodology: counting rules change what is visible in aggregated data (SRS to NIBRS, official poverty measure to SPM)
- Data processing artifacts: digitization, rounding, truncation introduce errors invisible without forensic analysis (Japanese SST records)
In each case, the corrective is the same: compare like with like. Scientists correcting the SST record do this by analyzing near-crossings between ships. Epidemiologists do it by modeling diagnostic changes as time-dependent covariates. The general public, facing a headline about a dramatic trend, can ask three questions:
- Did the definition of what we are measuring change?
- Did the instrument or method of measurement change?
- Did the intensity or scope of measurement change?
If the answer to any of these is yes, the trend line encodes two signals: one from reality, one from the measurement apparatus. Separating them is painstaking statistical work. Ignoring the distinction is how phantom epidemics are born.



