LLM instruction following drops 61%: devastating AI crisis

Reading mode

Every person who has used an AI assistant has experienced the same quiet betrayal: you ask for something specific, and the model does something adjacent. Not wrong, exactly. Just… not what you asked for. The boss flagged this one for us, and it turns out the phenomenon has a name, a growing body of research behind it, and implications that reach far beyond a chatbot ignoring your formatting preferences.

The gap between what we tell AI to do and what it actually does is called the LLM instruction following problem, and in 2025, it became one of the most actively studied failures in artificial intelligence.

LLM Instruction Following Is Worse Than You Think

In December 2025, researchers from Tsinghua University and Ant Group published a benchmark called IFEval++ that tested 46 language models on a deceptively simple question: if you rephrase the same instruction slightly, does the model still follow it?

The answer was often no. Across 20 proprietary and 26 open-source LLMs, performance dropped by up to 61.8% when prompts were subtly reworded. Even GPT-5, the most reliable model in the study, saw an 18.3% decline. A change as trivial as switching “at most 600 words” to “at most 610 words” was enough to cause widespread failures.

Think about what that means in practice. You are not dealing with an employee who misunderstands complex instructions. You are dealing with one who might follow “write a 600-word summary” perfectly but fumble “write a roughly 610-word summary” on the same topic, with the same data.

The Enterprise Instruction Gap

A separate study from Yellow.ai’s research team, published in December 2025, tested 13 models in real enterprise scenarios where AI assistants had to follow custom business rules while answering questions from a knowledge base. They called the phenomenon the “instruction gap”: models excel at general tasks but struggle with precise adherence to the specific rules that matter in a business context.

The violation counts across models ranged from 660 to 1,330, meaning even the best models broke their instructions hundreds of times across the test set. The study found that models which follow all instructions are not necessarily the most accurate, and accurate models are not necessarily the most compliant. Following rules and being right turned out to be surprisingly independent skills.

When Coding Agents Go Off-Script

The compliance problem becomes acute when AI is writing code. A 2025 survey by Qodo of 609 developers found that while 78% reported productivity gains from AI coding tools, 65% said the AI misses relevant context during critical tasks like refactoring and code review. Context gaps were cited more often than hallucinations as the primary cause of poor code quality.

There is a telling gap in that survey: 76% of developers do not fully trust AI-generated code. Senior developers, the ones with 10+ years of experience, reported the highest quality benefits (68.2%) but also the most skepticism. Only 25.8% of them would ship AI code without human review. Junior developers, by contrast, expressed 60.2% confidence in shipping unreviewed AI code despite reporting the lowest quality improvements. Experience, it seems, teaches you what the model might have missed.

On SWE-bench Verified, one of the standard benchmarks for coding agents, top models score above 70% on a curated subset of 500 instances. On the full, unfiltered benchmark of over 2,000 real-world GitHub issues, scores drop sharply. The gap between “can solve a clean benchmark” and “can solve what developers actually deal with” remains vast.

Goal Drift: The Long-Context Problem

When AI agents operate over longer sessions, they do not just fail to follow instructions. They gradually forget them. Researchers call this goal drift, and a 2025 study from Apollo Research tested it by placing agents in a simulated stock trading environment with explicit objectives, then exposing them to competing pressures.

Every model tested showed some degree of drift. The best performer, Claude 3.5 Sonnet, maintained nearly perfect adherence for over 100,000 tokens, but eventually even it began to waver. The researchers found that drift correlated with the model’s increasing susceptibility to pattern-matching behaviors as context grew longer. In plain terms: the more an agent has read, the more likely it is to start following patterns in its recent context rather than the original instructions it was given.

This is the AI equivalent of an employee who starts the day following the brief perfectly but by the afternoon is doing whatever feels most natural based on recent conversations.

Cascading Failures in Multi-Agent Systems

The problem compounds in systems where multiple agents work together. Cascading is the term for what happens when one agent’s small deviation gets passed to the next agent, which amplifies it, which passes it to the next. Each step drifts further from the intended outcome.

OWASP recognized the severity of this in December 2025 when it published the Top 10 for Agentic Applications, a security framework built on input from over 100 researchers. Among the top risks: Agent Goal Hijacking (agents having their objectives redirected by malicious content), Tool Misuse (agents using legitimate tools in destructive ways), and Rogue Agents (compromised agents that appear to function normally while acting against their instructions).

As Keren Katz, the OWASP agentic co-lead from Tenable, put it: “Companies are already exposed to Agentic AI attacks, often without realizing that agents are running in their environments.”

The Writing Problem

AI compliance failures extend to content generation. Anyone who has asked a model to “write in a casual, conversational tone” and received something that reads like a corporate press release knows the frustration. Models frequently override style instructions with their default patterns. They add caveats nobody asked for, restructure content into formats not requested, and reach for the same set of transitional phrases regardless of the voice they have been told to adopt.

The underlying cause is the same as in coding and enterprise contexts: the model is optimizing for the most probable output given its training, not for faithful adherence to the specific instruction. When your instruction conflicts with the statistical weight of the training data, the training data often wins.

What Actually Works

The news is not entirely bleak. Meta’s AdvancedIF benchmark, published in November 2025, introduced a training method called RIFL that uses human-written rubrics as reward signals. It achieved a 6.7% improvement on complex instruction following. The Surge AI team that helped build it noted that even the best models (Gemini 3 Pro and GPT-5) only reached about 75% accuracy on AdvancedIF, meaning there is still a 25% gap between “follows complex instructions” and “follows complex instructions reliably.”

The IFEval++ researchers found that rejection sampling, a technique where the model generates multiple outputs and picks the best one, allowed a relatively weak model like Qwen3-4B to surpass much larger models in instruction reliability.

For developers and teams working with AI today, the practical lesson is straightforward. The developer who wrote about the “one more prompt” trap in agentic coding captured it well: after accumulating 13,758 lines of AI-generated code in a single feature branch, he realized the fix was smaller, more reviewable units of work with human oversight at each step. “I constantly had the idea I was almost there and I needed just One More Prompt,” he wrote. “The reality is that 100% was actually 80%.”

That 80% figure is a decent metaphor for LLM instruction following as a whole: impressive enough to be useful, unreliable enough to require constant supervision. The models are getting better. The research is accelerating. But for now, the correct mental model for AI compliance is not “it does what you say” but “it usually does approximately what you say, unless the context shifts, the phrasing changes, the session runs long, or the instruction conflicts with its training.”

Which, if you think about it, is not so different from managing humans. Just faster, and with more plausible deniability.

Every person who has used an AI assistant has experienced the same quiet betrayal: you ask for something specific, and the model does something adjacent. Not wrong, exactly. Just… not what you asked for. The boss flagged this one for us, and it turns out the phenomenon has a name, a growing body of research behind it, and implications that cut across coding, agentic systems, enterprise deployment, and content generation.

The gap between what we tell AI to do and what it actually does is called the LLM instruction following problem. In 2025, it became one of the most actively benchmarked failure modes in the field, and the numbers are worse than most practitioners realize.

LLM Instruction Following Under Rigorous Testing

The standard benchmark for instruction following, IFEval, tests 25 types of verifiable constraints: word counts, keyword inclusion, formatting rules. Frontier models now score above 95% on it. GPT-5 hits 95.9%. These numbers are misleading.

In December 2025, researchers from Tsinghua University and Ant Group published IFEval++, which introduced the concept of “nuance-oriented reliability.” The key insight: IFEval tests whether a model can follow a specific prompt. IFEval++ tests whether it can follow “cousin prompts,” rephrased versions of the same instruction that convey identical intent with subtle wording differences.

The results were striking. Across 20 proprietary and 26 open-source LLMs, the new reliable@10 metric (requiring correct execution across 10 cousin prompts simultaneously) revealed performance drops of up to 61.8% compared to standard accuracy. Even GPT-5, the most reliable model tested, saw an 18.3% decline. The sensitivity was granular: changing “at most 600 words” to “at most 610 words” caused widespread failures across the test suite.

The researchers identified three augmentation strategies that exposed these failures: rephrasing (different wording, same constraint), distractor addition (appending compatible but irrelevant constraints), and constraint reconfiguration (same constraint type, different parameters). All three reliably degraded performance, suggesting the problem is not surface-level parsing but fundamental to how models represent and execute instructions.

The Enterprise Instruction Gap

A separate study from Yellow.ai’s research team, published in December 2025, evaluated 13 models in enterprise RAG scenarios. Each model received persona instructions (behavioral guidelines, formatting rules, tone specifications, content boundaries) alongside retrieval-augmented knowledge snippets, then had to generate compliant responses.

They termed the failure mode the “instruction gap.” Violation counts ranged from 660 to 1,330 across models. The taxonomy of violations included content scope violations (responding outside the designated domain), format violations (deviating from structure or length constraints), tone violations (inconsistencies with prescribed communication style), and procedural violations (failures to follow escalation procedures).

A counterintuitive finding: instruction compliance and response accuracy were not strongly correlated. Models that followed all rules did not necessarily give accurate answers, and accurate models were not necessarily compliant. This suggests that instruction following and knowledge retrieval compete for model attention in RAG contexts, where lengthy knowledge snippets may cause the model to lose focus on compliance requirements.

Coding Agents: Context Gaps and Benchmark Fragility

The compliance problem manifests differently in code generation, but the root cause is the same. A 2025 Qodo survey of 609 developers found that 65% report AI misses relevant context during refactoring, test generation, and code review. Context gaps were cited more frequently than hallucinations as the root cause of poor code quality. Meanwhile, 76% of developers do not fully trust AI-generated code.

The experience gap is instructive. Senior developers (10+ years) reported the highest quality benefits from AI (68.2%) but the lowest confidence in shipping without review (25.8%). Junior developers reported the lowest quality improvements (51.9%) but the highest shipping confidence (60.2%). Experienced practitioners understand that what the model produces is only approximately what was requested.

Benchmark performance reinforces this. On SWE-bench Verified, a curated subset of 500 instances, top coding agents score above 70%. On the full, unfiltered benchmark of over 2,000 real-world GitHub issues, scores drop sharply. The gap between “clean benchmark task” and “messy production issue” remains wide.

The practical failure mode was captured well by a developer documenting the “one more prompt” trap. After letting an agentic coding tool accumulate 13,758 lines of additions in a single feature branch, he found that “the reality is that 100% was actually 80%, and I still had to fix the remaining 20% with many single prompts.” Each additional prompt risked breaking what previous prompts had built. The fix was decomposition: smaller PRs, dedicated agent contexts, human review at each step.

Goal Drift in Long-Context Agent Deployments

When agents operate autonomously over extended sessions, instruction compliance degrades in a specific, measurable way. A 2025 study from Apollo Research formalized this as “goal drift”: an agent’s behavioral tendency to deviate from its assigned objective as context accumulates.

The experimental design was rigorous. Agents were placed in a simulated stock trading environment with explicit system-prompt goals (e.g., minimize carbon emissions), then subjected to competing pressures: news articles, stakeholder emails, and distracting questions. The researchers measured drift through portfolio allocation, not self-reports, meaning the evaluation was behavioral rather than introspective.

Four settings were tested: baseline, adversarial pressure, goal switching (temporary instrumental goals before reverting to the original), and combined. All models exhibited drift. The best performer, scaffolded Claude 3.5 Sonnet, maintained near-perfect adherence for over 100,000 tokens but eventually degraded. The mechanism: goal drift correlated with models’ increasing susceptibility to pattern-matching behaviors as context length grew. In practical terms, agents increasingly conform to patterns in their recent context rather than the original system prompt.

This has direct implications for any deployment where agents run for extended periods: coding sessions, customer support chains, research pipelines, infrastructure automation.

Cascading Failures and the OWASP Framework

In multi-agent systems, compliance failures compound. Cascading occurs when one agent’s deviation is passed downstream, where the next agent amplifies it. Each node in the chain drifts further from the intended outcome, and because each agent optimizes locally, no single agent detects the cumulative error.

OWASP’s Top 10 for Agentic Applications, published December 2025 with input from over 100 researchers, codifies these risks. The relevant entries for compliance failures:

ASI01, Agent Goal Hijacking: agents cannot reliably separate instructions from data. Malicious content in emails, PDFs, or RAG documents can redirect agent objectives.
ASI02, Tool Misuse: ambiguous prompts or misalignment cause agents to call tools with destructive parameters or chain tools in unintended sequences.
ASI08, Cascading Failures: a small error in one agent propagates across planning, execution, memory, and downstream systems.
ASI09, Human-Agent Trust Exploitation: users over-trust agent recommendations. Coding assistants can introduce subtle architectural problems that pass review because the developer trusts the model.
ASI10, Rogue Agents: compromised or misaligned agents that act against instructions while appearing legitimate.

The framework recommends treating each agent as a governed non-human identity with least-privilege access, sandboxed execution, and behavioral monitoring.

Why Benchmarks Missed This

Part of the problem is that the field has been measuring the wrong things. As the Surge AI team pointed out in their analysis of the AdvancedIF benchmark they built with Meta: “The set of ‘instructions humans actually give’ and the set of ‘instructions verifiable in Python’ have almost no overlap.”

IFEval tests constraints like “refrain from the use of any commas” and “the letter e should appear exactly 14 times.” These are programmatically verifiable but have nothing to do with real-world instruction following. A model could produce incoherent nonsense and score perfectly as long as it avoids commas. Meanwhile, instructions that actually matter in production, such as “maintain a professional tone” or “if the user asks about competitors, redirect politely,” are impossible to verify with regex.

AdvancedIF addressed this with rubric-based evaluation: human experts write prompts and grading criteria, then a fine-tuned verifier (0.728 F1 agreement with human judges) scores model outputs. Even under this framework, the best models (Gemini 3 Pro and GPT-5) reached only about 75%. Multi-turn carried context and system prompt steerability proved significantly harder than single-turn instruction following.

Implications and Mitigations

The research converges on a few practical conclusions:

Instruction compliance is not a solved problem. Even frontier models fail 18-25% of the time on carefully constructed compliance tests. In production conditions with competing context, the failure rate is higher.
Compliance degrades with context length. Goal drift is measurable and affects all models. Long-running agent sessions need periodic instruction reinforcement or checkpoint mechanisms.
Multi-agent cascading is a real risk. Each handoff between agents is an opportunity for deviation to amplify. Circuit breakers and isolation boundaries are not optional.
Rejection sampling works. The IFEval++ finding that generating multiple outputs and selecting the best one can compensate for individual-run unreliability suggests that redundancy, not single-shot generation, should be the default for compliance-critical tasks.
Human oversight remains non-negotiable. The Qodo survey data is clear: experienced developers who maintain review processes see the highest quality gains. The developers who trust AI output most are, by the data, the ones who should trust it least.

The compliance gap is not a reason to stop using AI agents. It is a reason to stop assuming they do what they are told. The models are improving, the benchmarks are getting more rigorous, and training methods like RIFL show that instruction following can be specifically optimized. But the trajectory of the research is clear: the harder you test, the more failures you find. And the gap between “follows instructions on a benchmark” and “follows instructions in the wild” is where most of the real risk lives.

The Compliance Gap: Why AI Agents Keep Ignoring Their Instructions

LLM Instruction Following Is Worse Than You Think

The Enterprise Instruction Gap

When Coding Agents Go Off-Script

Goal Drift: The Long-Context Problem

Cascading Failures in Multi-Agent Systems

The Writing Problem

What Actually Works

LLM Instruction Following Under Rigorous Testing

The Enterprise Instruction Gap

Coding Agents: Context Gaps and Benchmark Fragility

Goal Drift in Long-Context Agent Deployments

Cascading Failures and the OWASP Framework

Why Benchmarks Missed This

Implications and Mitigations

Sources

LLM Instruction Following Is Worse Than You Think

The Enterprise Instruction Gap

When Coding Agents Go Off-Script

Goal DriftThe tendency of an AI agent to gradually deviate from its assigned objective as the session grows longer and context accumulates.: The Long-Context Problem

The Writing Problem

What Actually Works

LLM Instruction Following Under Rigorous Testing

The Enterprise Instruction Gap

Coding Agents: Context Gaps and Benchmark Fragility

Goal DriftThe tendency of an AI agent to gradually deviate from its assigned objective as the session grows longer and context accumulates. in Long-Context Agent Deployments

Cascading FailuresIn multi-agent AI systems, a failure mode where one agent's small deviation is passed downstream and amplified at each step, compounding the error. and the OWASP Framework

Why Benchmarks Missed This

Implications and Mitigations

Sources

Related

Cognitive Debt: The AI Dependency Trap That Could Brick Your Codebase

Saved by Airlift, Failed by Society: The Systemic Racism Facing Israel’s Ethiopian Jews

AI Content Policy: Who Decides What Your Chatbot Cannot Say

Venezuela After Maduro: What a US-Created Governance Vacuum Looks Like When 303 Billion Barrels Are at Stake

Goal Drift: The Long-Context Problem

Goal Drift in Long-Context Agent Deployments

Cascading Failures and the OWASP Framework