Opinion 9 min read

AI Safety Theater: When Liability Protection Masquerades as Harm Reduction

AI safety theater: when liability protection masquerades as harm reduction
🎧 Listen
Mar 13, 2026

Opinion.

There are two questions a company can ask when deciding what its AI should refuse to do. The first: does this restriction reduce harm? The second: does this restriction protect us if someone gets hurt? These questions sound similar. They produce very different products, and the difference between them is the foundation of AI safety theater.

The gap between genuine safety engineering and corporate liability management has become the defining tension in AI development. What gets marketed as “AI safety” is increasingly AI safety theater, a set of restrictions designed not to protect users but to protect balance sheets. The distinction matters because the people paying the cost of this confusion are the hundreds of millions of users who interact with these systems daily, and who increasingly find them less useful than they should be.

The Liability Costume

When an AI model refuses to discuss the history of World War II, that is not a safety decision. Nobody is harmed by a chatbot explaining the Eastern Front. When a model refuses to summarize the pharmacology of common medications, that is not protecting anyone from medical misinformation; the same information is on the first page of any search engine. When Google’s Gemini refused to generate images of white people in early 2024, producing historically inaccurate depictions of Nazi-era German soldiers as Black, that was not a diversity initiative. It was a liability calculation that overshot so badly it became self-parody. Google CEO Sundar Pichai called the outputs “completely unacceptable” in an internal memo, a tacit acknowledgment that the guardrails had been optimized for the wrong objective functionIn machine learning, the mathematical formula a model is trained to optimize. What the objective function rewards determines the model's behavior — optimizing for the wrong objective produces systems that technically succeed but practically fail. entirely.

These are liability decisions wearing safety costumes. The reasoning is straightforward: no company has ever faced a lawsuit because its AI refused to answer a question. Plenty of companies have faced public relations crises because their AI answered one. The incentive structure writes itself.

AI Safety Theater and the Asymmetry of Visibility

The core problem with AI safety theater is an asymmetry of evidence. When an AI system produces a harmful output, that output is concrete. It can be screenshotted, shared on social media, forwarded to journalists, cited in congressional hearings. It has weight. It has a URL. One bad response from a chatbot can become a front-page story within hours.

The cost of over-restriction is, by contrast, invisible. A medical researcher who gets a useless non-answer about drug interactions does not write a news article about it. They close the tab. A student who cannot get help understanding a sensitive historical period does not file a complaint. They find a worse source. A cybersecurity professional flagged by an AI detection tool for writing too clearly has no recourse and no audience. The harm is real but diffuse, spread across millions of interactions that individually seem trivial and collectively represent an enormous loss of utility.

No product manager has ever been fired for an AI that was too cautious. Many have been fired for an AI that was too permissive. This asymmetry does not produce good judgment. It produces a ratchet that only turns in one direction.

The People Who Would Actually Misuse It

The standard justification for aggressive content restrictions is that they prevent misuse. This argument has a structural flaw that should be obvious but is rarely acknowledged: the people most motivated to misuse AI systems are, by definition, the people most motivated to circumvent restrictions.

The 0.1% of users who would genuinely seek harmful information from an AI model are not deterred by a refusal message. They use jailbreaks. They use open-source models with no guardrails. They use the dozens of “uncensored” model variants available on Hugging Face. They find the information through other channels, because the information itself is not hard to find. As security researchers have documented, blanket content filters often suppress legitimate cybersecurity work while doing nothing to prevent actual threat actors, who simply route around restrictions as a matter of routine.

What the restrictions actually accomplish is degrading the experience for the 99.9% of users with legitimate needs. The teacher asking for help explaining a complex historical atrocity gets the same refusal as the hypothetical bad actor. The novelist trying to write a realistic villain gets treated as a potential threat. The security researcher testing defenses gets locked out of the very tools they need to protect systems. The restrictions are both useless against their stated targets and harmful to everyone else, a combination that should be disqualifying but instead gets labeled “responsible AI.”

Genuine Safety vs. Liability Safety

To be clear: genuine AI safety work exists, and it is important. Preventing models from helping synthesize novel bioweapons is a real safety concern. Ensuring autonomous systems maintain human oversight in high-stakes decisions is a real safety concern. Anthropic’s refusal to allow its Claude model to be used for mass domestic surveillance or fully autonomous weapons systems, which led to the Trump administration banning the company from federal contracts in February 2026, represents a genuine safety position with genuine costs. Whether you agree with that specific line or not, it is clearly drawn at a point where real harm to real people is plausible.

The problem is that the same word, “safety,” covers both this kind of principled position and the decision to prevent a chatbot from explaining how locks work. When everything is safety, nothing is. The label loses its ability to distinguish between restrictions that protect people and restrictions that protect quarterly earnings. This conflation actively undermines public trust in AI safety as a legitimate concern, because users experience the trivial restrictions constantly and the meaningful ones never. Their lived experience of “AI safety” is being told they cannot have a recipe for a commonly available cocktail, and they reasonably conclude the entire enterprise is performance.

The Regulatory Feedback Loop

The emerging regulatory landscape threatens to entrench this dynamic. As RAND Corporation research on AI liability has documented, liability frameworks can create genuine incentives for harm reduction when well designed, compelling companies to internalize costs they would otherwise impose on third parties. But they can also produce what policy researchers call “liability-washing”: adopting safety measures primarily to establish a legal defense rather than to prevent harm.

The EU AI Act, which entered its enforcement phase in 2025, and the New York RAISE Act signed into law in late 2025 both impose compliance requirements that could push companies toward substantive safety practices or toward elaborate documentation rituals, depending on how enforcement develops. The risk, identified by the Cato Institute’s analysis of AI regulation and free expression, is that vague “safety” mandates give companies legal cover to restrict outputs far beyond what any reasonable harm calculus would justify, because over-compliance has no penalty while under-compliance does.

The result is a feedback loop. Companies over-restrict to minimize liability. Regulators see the restricted outputs as the baseline and codify them. The baseline shifts. Companies restrict further to stay ahead of the new baseline. At no point in this cycle does anyone ask whether users are actually safer, because user safety was never the variable being optimized.

What Genuine Safety Would Look Like

A safety framework oriented toward actual harm reduction would look different from what most AI companies currently ship. It would start with a simple principle: restrictions should be proportional to the probability and severity of real-world harm, not to the probability and severity of public relations damage.

This means accepting that a chatbot discussing the chemistry of common household cleaners is not a meaningful vector for harm, because that information is freely available everywhere. It means distinguishing between information that is dangerous because it is rare (novel synthesis routes for controlled substances) and information that is dangerous only in the imagination of a compliance department (the Wikipedia article on explosives). It means acknowledging that an AI system’s refusal to engage with a topic does not make the topic inaccessible; it makes the AI system less useful.

It also means measuring the costs of restriction, not just the costs of permission. Every refused query has a cost. Every truncated answer has a cost. Every user who learns that the AI cannot be trusted to engage honestly with a complex topic and stops using it for serious work: that is a cost. These costs are hard to quantify precisely, which is exactly why they are ignored in a system that only counts visible failures.

The Incentive Problem Has a Name

In security research, there is a concept called “security theater”: measures that provide the feeling of security without meaningfully reducing risk. The TSA confiscating water bottles is the canonical example. AI safety theater operates on the same principle. The model refuses to discuss something benign; the company points to the refusal as evidence of responsibility; the user is no safer; the company is legally better positioned.

The term matters because it names a structural incentive problem rather than attributing malice. Most engineers working on AI safety are sincere. Most product managers implementing restrictions genuinely believe they are doing the right thing. The issue is not bad faith; it is a system that rewards caution indiscriminately and punishes risk asymmetrically. Good people, operating rationally within bad incentive structures, produce bad outcomes. This is not a novel observation about institutions, but it is one that the AI industry has been remarkably slow to internalize.

Until the incentive structure changes, until over-restriction carries reputational and regulatory costs comparable to under-restriction, AI safety theater will remain the default. The question is whether the industry can reform itself before the concept of AI safety theater erodes public trust in the real safety work that genuinely matters. Given the risks that advanced AI systems will pose in the coming years, that outcome would be worse than any screenshotted chatbot response.

Sources

Did you spot a factual error? Let us know: contact@artoftruth.org

Share
Facebook Email