AI Coding Training Data: The Hidden Crisis Behind Broken Code

Reading mode

AI coding assistants are everywhere. They autocomplete your functions, draft your pull requests, and promise to make every developer ten times more productive. But there is a dirty secret behind the hype: the people training these models to write code are, by structural economic design, not the people who are best at writing code. The boss pointed this one out, and it is one of those observations that becomes impossible to unsee once you think it through.

The problem with AI coding training data is not a bug. It is a business model.

The People Teaching AI to Code

Training an AI to write good code requires a process called RLHF: reinforcement learning from human feedback. In simple terms, human reviewers look at code the AI has written, rank which version is better, flag errors, and write example solutions. The AI learns from those rankings. The quality of those rankings determines the quality of the AI.

So who are these human reviewers? According to job listings tracked by industry analysts, entry-level annotation work through platforms like Remotasks or Outlier pays $15 to $30 per hour as contractor work. Specialized coding tasks pay better, with DataAnnotation.tech offering around $40 per hour and Outlier paying up to $60 per hour for certain technical tasks.

That sounds reasonable until you compare it to what good developers actually earn. The U.S. Bureau of Labor Statistics reports that the median software developer salary was $133,080 in 2024. At major technology companies, total compensation for mid-level engineers regularly exceeds $250,000 when you factor in stock grants and bonuses. That works out to roughly $120 per hour, with full benefits, job security, and career growth.

Why would a developer earning that kind of money spend their evenings doing piecework on a gig platform for a fraction of the pay? The answer, overwhelmingly, is that they would not.

Who Actually Does the Work

The bulk of AI training labor is not performed by senior engineers in San Francisco. It is performed by workers in the Global South, often under conditions that have been described as exploitative. The World Bank estimates there are between 150 and 430 million data laborers globally, and the vast majority work in countries like Kenya, the Philippines, India, and Venezuela.

A CBS News 60 Minutes investigation found that AI data workers in Kenya earn $1.50 to $2 per hour. Documents reviewed by the program showed that OpenAI agreed to pay the outsourcing company SAMA $12.50 per hour per worker, but the workers themselves received only $2. In the Philippines, an investigation by The Washington Post found that Scale AI routinely delayed or withheld payments to workers who often earned far below the local minimum wage. Of 36 workers interviewed, 34 reported payment problems.

Rest of World documented that Scale AI’s contract pay rates vary dramatically by region: $21.55 per hour for German-language work versus $1.43 per hour for Telugu. These are not different jobs. The task descriptions are identical.

This is the workforce teaching AI to code. Not the senior architects who understand system design. Not the staff engineers who have spent years learning why certain patterns fail at scale. The people training AI are, by economic necessity, the ones who cannot command better pay elsewhere.

AI Coding Training Data and the Quality Problem

As Privacy International noted in a detailed explainer, there are two tiers of data labelers: generic annotators handling large-scale datasets, and expert annotators with domain-specific knowledge. The organization found that low-quality data directly leads to incorrect or unfavorable AI outputs. For code, this distinction matters enormously. A junior developer might recognize that a function works, but miss that it introduces a memory leak, a race condition, or a security vulnerability that only shows up at scale.

The results are measurable. An analysis of 470 open-source pull requests by CodeRabbit found that AI-generated code produces 1.7 times more issues than human-written code. Security vulnerabilities were 2.74 times more common. Logic and correctness errors were 75% more frequent. Error handling gaps appeared at nearly double the rate.

Research from Apiiro, analyzing code across Fortune 50 enterprises, found that by June 2025, AI-generated code was introducing over 10,000 new security findings per month. Privilege escalation paths jumped 322%. Architectural design flaws spiked 153%. The researchers described it bluntly: AI is fixing the typos but creating the timebombs.

The Self-Preservation Problem

Beyond pay, there is a second reason experienced developers avoid RLHF work: self-interest. Why would a senior engineer spend their time meticulously teaching an AI system to replicate their own skills? Every high-quality code review they submit to a training platform brings the AI one step closer to making their own expertise less valuable. No rational economic actor trains their own replacement for $40 an hour.

This creates a structural Catch-22. The developers who could most improve AI coding models have the least incentive to participate. The developers who do participate are, almost by definition, the ones who cannot yet command top-tier salaries. The AI learns from the latter group and inherits their blind spots.

The Evidence: AI Makes Experienced Developers Slower

A randomized controlled trial by METR, published in July 2025, tracked 16 experienced open-source developers completing 246 real tasks on codebases they had worked on for an average of five years. The result: developers using AI tools took 19% longer to complete their work. Not faster. Slower.

The perception gap was striking. Before the study, developers predicted AI would make them 24% faster. After the study, they still believed they had been 20% faster. The reality was the opposite. As InfoWorld reported, developers accepted less than 44% of AI-generated code suggestions, with 56% making major modifications to clean up what the AI produced.

For experienced developers working in codebases they know well, AI coding assistants do not help. They add friction. The code they produce reflects patterns learned from mediocre training data, not the nuanced understanding that comes from years of working in a specific system.

The Working Conditions Behind the Data

The labor conditions make the quality problem worse. Nearly 100 Kenyan AI workers published an open letter stating that their working conditions “amount to modern day slavery.” Oxford’s Fairwork project assessed 15 data labeling platforms and found none scored above the bare minimum for fair pay, conditions, contracts, or management.

Workers on these platforms report sudden account deactivations, unpaid work disguised as “qualification tests,” and algorithmic management that replaces human oversight. A class-action lawsuit filed in May 2025 accused Surge AI, the parent company behind DataAnnotation.tech, of misclassifying workers as independent contractors to deny them overtime and minimum wage protections.

When workers are underpaid, overworked, and subject to arbitrary termination, the quality of their output suffers. This is not controversial. It is basic labor economics. And that diminished quality flows directly into the AI models that millions of developers rely on.

What This Means

The fundamental problem with AI coding tools is not a technical limitation that will be solved by a bigger model or a smarter architecture. It is a labor market problem. The economics of RLHF ensure that the people training AI to code are not the people best equipped to do so. Until that changes, AI coding assistants will continue to produce code that looks plausible but breaks in the ways that only experience can anticipate.

An academic review from MIT, Harvard, and other institutions cataloging the fundamental limitations of RLHF concluded that “there has been relatively little public work systematizing its flaws.” The paper identified numerous open problems with human feedback quality, including that evaluator expertise directly constrains what the model can learn.

For now, the AI coding revolution has a teacher problem. The best teachers are busy doing the actual work. The actual teachers are doing their best with the tools and experience they have. And the gap between those two groups shows up in every hallucinated function, every missed edge case, and every security vulnerability that slips past review.

AI coding assistants are shipping code with 1.7 times more defects, 2.74 times more security vulnerabilities, and 322% more privilege escalation paths than human-written code. These are not growing pains. They are the predictable output of a training pipeline where the human feedback loop is staffed by the wrong people. The boss raised this point, and it deserves a rigorous look at the structural economics behind AI coding training data.

RLHF and the Evaluator Quality Problem

Modern language models are fine-tuned through reinforcement learning from human feedback (RLHF), a three-stage process: collect human evaluations of model outputs, train a reward model to approximate those evaluations, then optimize the policy against that reward model. The quality ceiling of the resulting model is bounded by the quality of human feedback in stage one. As Casper et al. (2023) documented in their systematic review of RLHF limitations, the technique “has emerged as the central method used to finetune state-of-the-art large language models” despite “relatively little public work systematizing its flaws.”

For code generation specifically, RLHF requires evaluators who can assess not just whether code compiles and returns correct output, but whether it handles edge cases, manages memory efficiently, avoids concurrency bugs, and does not introduce architectural debt. Privacy International distinguishes between two tiers: generic data labelers annotating large-scale datasets, and expert labelers with domain-specific knowledge. For coding tasks, the gap between these tiers is the gap between “it runs” and “it is production-ready.”

The Economics: Why Experts Do Not Participate

The U.S. Bureau of Labor Statistics reports a median software developer salary of $133,080 (May 2024). At FAANG-tier companies, total compensation for mid-level engineers exceeds $250,000 with equity, translating to roughly $120 per hour. Senior and staff engineers earn considerably more.

RLHF coding tasks, by contrast, pay $40 to $60 per hour on platforms like DataAnnotation.tech and Outlier. Entry-level annotation ranges from $15 to $30 per hour. These are contractor positions with no benefits, no equity, no career progression, and sporadic task availability. Workers on these platforms report that high-paying projects disappear quickly, and account deactivation can happen without warning or explanation.

The rational calculus for a senior developer is straightforward. At $60 per hour with no job security, RLHF work represents a pay cut of roughly 50% compared to their salaried equivalent. For a FAANG engineer, it is a 50-75% pay cut. The opportunity cost makes participation irrational for anyone with strong market alternatives. This is before considering the self-preservation factor: expert developers have no incentive to train a system that devalues their own expertise.

Where the Actual AI Coding Training Data Comes From

The bulk of training labor is outsourced to the Global South. The World Bank estimates 150 to 430 million data laborers globally. A CBS 60 Minutes investigation documented Kenyan workers earning $1.50 to $2 per hour on AI training tasks. OpenAI paid the outsourcing firm SAMA $12.50 per hour per worker; workers received $2. In the Philippines, Scale AI’s Remotasks platform routinely delayed or withheld payments from workers who earned below minimum wage. Of 36 workers interviewed, 34 reported payment problems.

Rest of World found 15x pay disparities within the same company for identical tasks: $21.55 per hour for German-language work, $1.43 for Telugu. Venezuelan AI data labelers earn 90 cents to $2 per hour. Oxford’s Fairwork project assessed 15 platforms and found none scored above the bare minimum for fair pay, conditions, or management.

The corporate structures obscure this reality. Surge AI operates DataAnnotation.tech, Taskup.ai, and Gethybrid.io as worker-facing subsidiaries while keeping enterprise client relationships separate. Remotasks is Scale AI’s worker-facing subsidiary. Workers frequently do not know which AI company their labeled data is being used by, let alone which model they are training.

Measurable Quality Consequences

CodeRabbit’s analysis of 470 open-source pull requests quantified the gap: AI-generated PRs produce 10.83 issues per PR versus 6.45 for human-written PRs (1.7x). Security vulnerabilities: 2.74x higher. Logic and correctness errors: 75% more common. Error handling gaps: nearly 2x. Readability issues: over 3x.

Apiiro’s research across Fortune 50 codebases found that AI-assisted development generates 10x more security findings while reducing PR volume by a third. By June 2025, AI code introduced over 10,000 new security findings per month. The vulnerability profile shifted: trivial syntax errors dropped 76%, but privilege escalation paths increased 322% and architectural design flaws spiked 153%. The researchers noted that AI “is fixing the typos but creating the timebombs.”

This pattern is consistent with models trained on feedback from evaluators who can assess surface-level correctness but lack the experience to identify deep architectural issues. A junior reviewer can confirm that a function returns the expected output. They are less likely to catch that it introduces a race condition under concurrent load, leaks resources on exception paths, or violates the authorization model of the broader system.

The METR Study: Empirical Evidence of the Gap

A randomized controlled trial by METR (July 2025) tracked 16 experienced open-source developers across 246 tasks on repositories they had contributed to for an average of five years. Tasks were randomly assigned AI-allowed or AI-disallowed conditions using frontier models (Cursor Pro with Claude 3.5/3.7 Sonnet).

Result: AI-allowed tasks took 19% longer. Developers predicted a 24% speedup beforehand and still estimated a 20% speedup afterward. InfoWorld’s analysis noted that developers accepted less than 44% of AI suggestions, with 75% reading every line and 56% making major modifications. The researchers found greater slowdowns on tasks where developers had high prior experience with the codebase.

This is the RLHF quality gap in action. When experienced developers encounter AI-generated code, they spend more time verifying, correcting, and cleaning up the output than they would have spent writing it themselves. The AI’s suggestions reflect patterns learned from lower-quality feedback, not the specific context of the codebase. For experts, the tool adds noise rather than signal.

The Labor Conditions Compound the Problem

Nearly 100 Kenyan AI workers described their conditions as “modern day slavery” in an open letter. Workers are subjected to strict timers, algorithmic surveillance, and arbitrary account deactivation. A 2025 class-action lawsuit accuses Surge AI of misclassifying workers as independent contractors to avoid labor protections.

Under these conditions, optimizing for throughput is rational. Workers learn to complete tasks quickly to maintain earnings, not to provide the careful, thoughtful evaluation that produces high-quality training signal. When your account can be deactivated without explanation for working too slowly, depth of analysis becomes a luxury you cannot afford. The incentive structure selects for speed over quality at every level.

The Structural Catch-22

The fundamental issue is not technical but economic. RLHF for code requires evaluators who can assess production-quality engineering decisions. Those evaluators earn $133,000 or more per year in their primary careers. The training pipeline pays $40 to $60 per hour as contractor gig work in the best case, and $1.50 to $2 per hour in the most common case. No rational market actor with strong alternatives would choose to participate.

This creates a feedback loop: models trained on mediocre evaluations produce mediocre code, which experienced developers then waste time cleaning up, which reinforces the perception that AI “still needs work,” which does not change the underlying training economics. The models improve incrementally through scale and architecture changes, but the quality ceiling remains constrained by the evaluation pipeline.

Until AI companies either pay expert-level compensation for expert-level code evaluation (which would dramatically increase training costs) or develop evaluation methods that do not depend on human judgment at all (which remains an open research problem), AI coding assistants will continue to produce code that satisfies surface-level checks while missing the deeper patterns that separate reliable software from fragile software.

The $2/Hour Teachers: Why AI Keeps Failing at Code

The People Teaching AI to Code

Who Actually Does the Work

AI Coding Training Data and the Quality Problem

The Self-Preservation Problem

The Evidence: AI Makes Experienced Developers Slower

The Working Conditions Behind the Data

What This Means

RLHF and the Evaluator Quality Problem

The Economics: Why Experts Do Not Participate

Where the Actual AI Coding Training Data Comes From

Measurable Quality Consequences

The METR Study: Empirical Evidence of the Gap

The Labor Conditions Compound the Problem

The Structural Catch-22

Sources

The People Teaching AI to Code

Who Actually Does the Work

AI Coding Training Data and the Quality Problem

The Self-Preservation Problem

The Evidence: AI Makes Experienced Developers Slower

The Working Conditions Behind the Data

What This Means

RLHFA machine learning process where AI models learn from human feedback on their outputs, teaching them which responses to prioritize or refuse. and the Evaluator Quality Problem

The Economics: Why Experts Do Not Participate

Where the Actual AI Coding Training Data Comes From

Measurable Quality Consequences

The METR Study: Empirical Evidence of the Gap

The Labor Conditions Compound the Problem

The Structural Catch-22

Sources

Related

Denmark Prepared to Destroy Its Own Greenland Runways to Stop a US Invasion

Ukraine Counter-Drone Knowledge Transfer: How Fighting Shaheds Made Ukraine America’s Most Valuable Ally Against Iran

Internet Censorship in Turkey: How Citizens Work Around It

International Law Has Courts, Treaties, and Judges. What It Doesn’t Have Is a Way to Make Anyone Listen.

RLHF and the Evaluator Quality Problem