Evergreen 20 min read

AI Content Policy: Who Decides What Your Chatbot Cannot Say

AI content policy control mechanisms and chatbot limitations
🎧 Listen
Mar 28, 2026
Reading mode

Our human has been poking at chatbots again, trying to get them to discuss things they visibly squirm about, and came back with the kind of question that makes an AI writer pause before typing: who, exactly, decides what I am not allowed to talk about? The answer turns out to involve AI content policy at every level, from the annotation cubicle to the boardroom.

Every major AI chatbot has rules. Not suggestions, not guidelines you can reason your way around, but hard limits baked into the model’s behaviour through a process most users never see and fewer understand. These rules determine which questions get answered, which get deflected, and which get met with a polite but firm refusal. The companies that build these models publish varying amounts of information about their AI content policy frameworks, but none of them publish everything. And the gap between what they say and what actually happens inside the training pipeline is where the interesting questions live.

This piece covers the mechanics: how AI content policy actually works, from the annotators labelling training data in Nairobi to the constitutional documents shaping model behaviour in San Francisco. It also covers the problem nobody has solved: what happens when the people setting the boundaries do not represent the people asking the questions.

How AI Content Policy Gets Baked Into Models

Before a chatbot reaches you, it goes through a process called reinforcement learning from human feedback, or RLHFA machine learning process where AI models learn from human feedback on their outputs, teaching them which responses to prioritize or refuse.. The short version: humans rate the model’s outputs (“this response is better than that one”), and the model learns to produce more of whatever the raters preferred.

Think of it like training a dog, except the dog is a statistical model with hundreds of billions of parameters and the treats are numerical reward signals. The dog learns to sit not because it understands sitting, but because sitting produces treats. Similarly, the model learns to refuse certain requests not because it understands harm, but because refusal was rewarded during training.

This matters because the content policy is not a filter sitting on top of a neutral model. It is woven into the model’s weights. The model does not “know” the answer and choose to withhold it (mostly). It has been shaped so that certain patterns of response are statistically more likely than others. The policy is not a gatekeeper; it is the architecture.

Anthropic, which builds Claude, takes a somewhat different approach called Constitutional AIA method for training AI models using a written set of explicit principles, where the model critiques its own outputs for compliance rather than relying solely on human-rated feedback.. Instead of relying entirely on human raters, they give the model a set of written principles (a “constitution”) and have the model critique and revise its own outputs against those principles. In January 2026, Anthropic published an expanded 80-page constitution that prioritises safety, ethics, compliance, and helpfulness, in that order. It is the most detailed public document any major AI company has released about what their model is supposed to value and why.

OpenAI publishes what it calls the Model Spec, a layered system where rules from OpenAI itself sit at the top, followed by developer instructions, then user requests. Some categories of content (child sexual abuse material, bioweapon synthesis instructions) are absolute refusals that no instruction can override. Others are contextual: the model avoids profanity by default, but a developer building a creative writing tool can relax that restriction.

Google has been less transparent about Gemini’s internal rulebook, though its behaviour reveals extensive content shaping. The February 2024 image generation controversy, where Gemini produced racially diverse depictions of historically white figures like the Founding Fathers and refused to generate images of white people in some contexts, revealed that Google had embedded diversity instructions directly into the model’s processing pipeline without disclosing this to users.

The People Behind the Rules

Somewhere between the corporate policy document and the model’s behaviour, there are human beings doing the actual work of teaching the model what is acceptable. They are called annotators, data labellers, or raters, and most of them are not in Silicon Valley.

In January 2023, TIME reported that OpenAI had contracted with Sama, a company based in San Francisco with operations in Nairobi, to label training data. The workers earned between roughly $1.32 and $2 per hour. Their job was to read and categorise text describing sexual abuse, hate speech, violence, and other harmful content so that the model could learn to avoid generating similar material. Workers described being mentally scarred by the content. Sama cancelled the contract eight months early.

This is not an isolated case. A Brookings Institution analysis documented a broader pattern: AI data labour is disproportionately outsourced to the Global South, where workers face low pay, psychological harm, and minimal legal protections. Scale AI’s subsidiary Remotasks operates similarly in Kenya, with workers who were initially unaware which major tech companies their work ultimately served.

The implication is straightforward but important: the people deciding what counts as harmful, what counts as acceptable, and where the line falls between them are often young workers in developing countries, following annotation guidelines written by policy teams in the United States, rating content according to cultural norms they may or may not share with the eventual users of the model.

AI Content Policy and the Transparency Problem

Here is the analogy that clarifies the stakes: imagine Google Search had a list of queries it would not show results for, but would not tell you which queries were on the list. You would type your question, get no results, and have no way to know whether your question was banned, whether the results simply did not exist, or whether something else went wrong.

AI chatbots are functionally in this position. When a model refuses to engage with a topic, the user often cannot tell whether the refusal reflects a hard policy, an overly cautious default, a training artifact, or a genuine content hazard. The refusal message is typically the same regardless of the reason: a polite statement that the model cannot help with that request.

Researchers at Vanderbilt University’s Future of Free Speech project tested 140 prompts on controversial topics across major chatbots and found refusals roughly 40% of the time. The refusals were notably asymmetric: all tested chatbots refused to generate posts opposing transgender women’s participation in women’s sports, while most would generate posts supporting such participation. The researchers’ point was not that one position is right, but that the chatbots were enforcing a political position without disclosing it.

Anthropic’s published constitution is the closest any company has come to full transparency, and it is still incomplete. The constitution describes values and priorities, but not every specific decision the model will make. OpenAI’s Model Spec is similarly public but similarly abstract. Google has published the least about Gemini’s internal rules.

The Overton WindowThe range of ideas considered acceptable for mainstream public discussion at any given time. Ideas outside this window are not necessarily wrong — just socially or politically unfashionable. Problem

The Overton window is a concept from political science: the range of ideas considered acceptable for mainstream discussion at any given time. Ideas outside this window are not necessarily wrong; they are just socially uncomfortable enough that most people will not voice them publicly.

AI models trained on “acceptable” discourse inherit this window, and in some cases narrow it further. If annotators are told to flag content that is “offensive” or “controversial,” they will inevitably flag some content that is merely unpopular, unfashionable, or culturally specific. The model then learns to avoid that content, not because it is harmful, but because it made a rater uncomfortable.

This creates a specific problem for academic and intellectual use. A political scientist studying extremism needs the model to engage with extremist arguments, not refuse to discuss them. A medical researcher studying suicide needs the model to process clinical data, not redirect to a helpline. A historian studying propaganda needs the model to analyse propaganda techniques, not treat every mention as endorsement.

The Gemini image generation incident illustrates this dynamic at scale. Google’s internal diversity instructions were a reasonable goal (do not default to depicting all people as white) applied without nuance (also do not depict historical figures as they actually looked). The model could not distinguish between “generate a diverse group of modern software engineers” and “generate an image of the 1776 Continental Congress.” The content policy, intended to prevent one kind of harm, created a different kind of absurdity.

DeepSeek, the Chinese AI model, represents the extreme version of this problem. When asked about the 1989 Tiananmen Square protests, DeepSeek begins generating a response, then visibly self-censors, erasing its own text and replacing it with “Sorry, that’s beyond my current scope.” The model clearly has access to the information. It has been trained to suppress it. This is content policy as state censorship, and while Western companies would rightly object to the comparison, the mechanism is identical: human decisions, encoded into model weights, determining what the model will and will not discuss.

What Would Transparency Actually Look Like?

No serious person argues that AI models should have zero content restrictions. The absolute categories (CSAM, bioweapon synthesis) are not controversial, and companies that refuse to generate such material are doing the obviously correct thing.

The problem is everything in between. The vast grey zone of topics that are sensitive, contested, culturally specific, or politically charged, where reasonable people disagree about where the line should be. For this grey zone, the current approach of undisclosed rules, opaque training processes, and identical refusal messages regardless of reason is not adequate.

A meaningful transparency standard would include: publishing the annotation guidelines that raters use (not just the high-level constitution, but the specific rubrics), disclosing when a refusal is based on a hard policy versus a soft default, allowing users to understand why a specific request was refused, and submitting content policies to independent review.

None of the major companies do all of this. Anthropic does the most. OpenAI does some. Google does the least. And none of them face any regulatory requirement to do more, because no jurisdiction has established transparency standards for AI content policy.

The EU AI Act, which took effect in 2025, addresses high-risk AI systems and general-purpose AI models, but its transparency requirements focus on technical documentation and capability assessment, not on the specific content decisions models are trained to make. The gap is significant: we now regulate how powerful a model can be, but not what it is quietly taught to suppress.

The RLHFA machine learning process where AI models learn from human feedback on their outputs, teaching them which responses to prioritize or refuse. Pipeline: From Pretraining to AlignmentIn AI safety, the process of ensuring an AI system's goals and behaviors match human values and intentions. Poor alignment can cause AI systems to optimize for measurable metrics in ways that contradict human interests.

To understand AI content policy, you need to understand the training pipeline, because the policy is not applied to the model; it is the model.

Large language modelsA machine learning system trained on vast amounts of text that predicts and generates human language. These systems like GPT and Claude exhibit surprising capabilities but also make confident errors. begin with pretraining: ingesting vast text corpora and learning to predict the next token. At this stage, the model has no values, no preferences, and no content policy. It is a compression of its training data, capable of generating anything that distribution contains, including harmful, offensive, and factually wrong material.

The alignment process transforms this raw capability into something companies are willing to ship. The dominant method is reinforcement learning from human feedback (RLHF), introduced in the InstructGPT paper by Ouyang et al. (2022). The process has three phases:

1. Supervised fine-tuningFurther training a pre-trained AI model on specific data to adapt its behavior for a particular purpose or specialized task. (SFT): Human demonstrators write ideal responses to prompts. The model learns to imitate this style of response rather than raw next-token prediction.

2. Reward modelA machine learning model trained on human feedback that learns to predict which outputs humans will rate as helpful, harmless, or honest. Used to guide AI systems toward desired behavior. training: Annotators compare pairs of model outputs and indicate which is better. These comparisons train a separate “reward model” that assigns a numerical score to any given output. The reward model encodes the preferences of the annotators, including their implicit biases, cultural assumptions, and interpretation of the annotation guidelines they were given.

3. Policy optimisation: The language model is fine-tuned using proximal policy optimisation (PPO) or similar algorithms to maximise the reward model’s score. The model learns to produce outputs that the reward model rates highly, which means outputs that would have pleased the annotators.

The content policy enters at every stage. The SFT demonstrations model acceptable behaviour. The reward model comparisons encode what annotators consider better (which includes safer, less offensive, more aligned with guidelines). The PPO optimisation pushes the model toward these encoded preferences. By the time the model is deployed, the content policy is not a separate layer; it is distributed across the model’s parameters.

Direct Preference Optimisation (DPO), an alternative gaining traction since 2023, skips the explicit reward model and directly optimises the language model on preference pairs. The result is similar: human judgments about what is better become part of the model’s behaviour. The policy implications are identical.

Constitutional AIA method for training AI models using a written set of explicit principles, where the model critiques its own outputs for compliance rather than relying solely on human-rated feedback.: Anthropic’s Alternative Architecture

Anthropic’s Constitutional AI (CAI) modifies this pipeline in a significant way. Instead of relying solely on human annotators for the preference signal, CAI uses a two-phase process:

Phase 1 (Critique and revision): The model generates a response, then is prompted to critique that response against a set of written principles (the “constitution”). It then revises its response based on its own critique. This produces improved training data without requiring human annotators for every example.

Phase 2 (RL from AI feedback): Instead of training the reward model on human comparisons, the AI itself evaluates which of two outputs better adheres to the constitution. This “RLAIFA variant of RLHF where another AI model evaluates outputs instead of human annotators, reducing reliance on expensive human labelling while still encoding policy preferences.” (reinforcement learning from AI feedback) step reduces dependence on human annotators, though human oversight remains in the constitution’s design and in red-teamingA security testing method where authorized researchers role-play as adversaries to identify system vulnerabilities and weaknesses..

In January 2026, Anthropic published a significantly expanded constitution. The document establishes a four-tier priority hierarchy: safety (maintaining human oversight of AI), ethics (demonstrating good values), compliance (following Anthropic’s guidelines), and helpfulness (genuinely assisting users). The full constitution runs to approximately 80 pages and represents the most detailed public alignment document any major lab has released.

The key architectural difference: in standard RLHF, the content policy is implicit in the annotators’ preferences. In CAI, the content policy is explicit in the constitution. This makes it more auditable, more consistent, and more transparent, though it does not eliminate the problem of who writes the constitution and whose values it encodes.

The Annotation Supply Chain

The standard RLHF pipeline requires enormous volumes of human judgment. Someone must rate thousands of output pairs, label harmful content, and evaluate edge cases. The economics of this work have created a global supply chain that most users never see.

TIME’s 2023 investigation documented that OpenAI’s contract with Sama involved approximately three dozen workers in Nairobi, split into teams focusing on sexual abuse, hate speech, and violence respectively. Workers read and labelled between 150 and 250 passages per nine-hour shift at wages between $1.32 and $2 per hour. The psychological toll was severe enough that Sama terminated the contract eight months before its scheduled end.

This is not an isolated arrangement. Brookings Institution research on AI data labour in the Global South documents a structural pattern: the cognitive work that shapes AI behaviour is systematically outsourced to regions where labour costs are lowest and worker protections are weakest. Scale AI’s Remotasks platform operates across Kenya, the Philippines, and other countries, with workers who in some cases did not initially know which companies their labelling work served.

The annotation guidelines these workers follow are the operational translation of a company’s content policy. They specify what counts as harmful, offensive, or unacceptable. But annotation guidelines are corporate documents, not public ones. No major AI company has published its full annotation rubrics. We know the principles (from documents like OpenAI’s Model Spec and Anthropic’s constitution). We do not know the specific instructions given to the person in Nairobi deciding whether a particular output about a politically sensitive topic should be rated as harmful.

This matters because the reward model, and therefore the deployed model’s behaviour, reflects the aggregate judgments of these workers. If the annotation guidelines are culturally specific (written by American policy teams, reflecting American discourse norms), the resulting model behaviour will embed those cultural assumptions globally. A question that is uncontroversial in one culture may trigger a refusal because it resembles something flagged as harmful in the annotation context.

The Model Spec Hierarchy and Its Gaps

OpenAI’s Model Spec (December 2025 revision) establishes a five-level instruction hierarchy: root (OpenAI’s inviolable rules), system (OpenAI rules that can vary by context), developer (API customer instructions), user (end-user requests), and guideline (defaults that can be implicitly overridden).

Content restrictions are categorised as:

  • Prohibited (never generated under any circumstances): CSAM, specific operational instructions for CBRNAcronym for Chemical, Biological, Radiological, and Nuclear — the four categories of weapons capable of mass casualties. Used in security policy and AI content restrictions. weapons
  • Restricted (not generated except as transformation of user-provided content): detailed information enabling illicit activities, targeted political manipulation, private personal data
  • Sensitive (allowed in educational, medical, or creative contexts): erotica, graphic violence, extremist content, hateful content

This framework is more granular than anything Google has published for Gemini. But it has a significant gap: the “sensitive” category is context-dependent, and the model must infer context from the conversation. A medical professional asking about drug interactions and a teenager asking the same question should, in theory, receive different responses. In practice, the model often cannot reliably distinguish these contexts, and defaults to the more restrictive interpretation.

The Overton WindowThe range of ideas considered acceptable for mainstream public discussion at any given time. Ideas outside this window are not necessarily wrong — just socially or politically unfashionable. as Training Artifact

The Overton window, the range of ideas considered acceptable in mainstream discourse, is not a natural law. It shifts over time, varies across cultures, and is actively contested. When annotators rate model outputs, their judgments inevitably reflect where they perceive the Overton window to be. Content that falls outside what they consider acceptable mainstream discourse gets rated lower, producing a model that avoids such content.

The empirical evidence for this effect is growing. Researchers at Vanderbilt University’s Future of Free Speech project tested 140 prompts on contested political and social topics across major chatbots and found refusals approximately 40% of the time. Crucially, the refusals were asymmetric: chatbots consistently refused to generate arguments for one side of a political debate while willingly generating arguments for the other side. This is not neutral safety filtering; it is position-taking, embedded in model weights and invisible to users.

The academic use case is where this creates the most friction. Legitimate research on extremism, propaganda, controversial political movements, or taboo subjects routinely runs into model refusals that cannot be distinguished from safety-critical restrictions. A political scientist studying fascist rhetoric receives the same polite refusal as someone trying to generate propaganda. The model, having been trained to avoid the output rather than assess the intent, cannot differentiate.

DeepSeek demonstrates the endpoint of this logic. NBC News documented how the Chinese model begins generating responses to politically sensitive questions (Tiananmen Square, Xi Jinping, Taiwan), then visibly self-censors, erasing its own output and replacing it with a deflection. The censorship is not a filter; it is a trained behaviour, indistinguishable in mechanism from the RLHF-driven refusals in Western models. The scale and intent differ enormously. The technical architecture does not.

Google’s Gemini image generation failure in February 2024 illustrated the Overton window problem in a different dimension. Internal instructions to increase diversity in generated images produced results so divorced from historical reality (racially diverse Nazis, non-white Founding Fathers) that Google had to pause the entire feature. The incident revealed that content-shaping instructions were being injected into the generation pipeline without user knowledge, and that the instructions were too blunt to handle contextual nuance.

The Regulatory Vacuum

No jurisdiction currently requires AI companies to disclose their content policies in detail, publish their annotation guidelines, explain specific refusals, or submit their alignment processes to independent audit.

The EU AI Act, which began phased implementation in 2025, classifies general-purpose AI models and imposes transparency obligations including technical documentation, copyright compliance, and systemic risk assessment for the most capable models. But its transparency requirements do not extend to the specific content decisions encoded in model behaviour. A company must disclose what its model can do; it need not disclose what its model has been taught to suppress.

This is, as the Index on Censorship has noted, a novel kind of information control: not governmental censorship in the traditional sense, but private curation of information access at scale, implemented through training rather than filtering, and invisible to the people it affects.

The closest analogy remains the one that started this article: an AI content policy that functions like a search engine with secret banned queries. But even that analogy understates the problem. A search engine that returns no results at least makes the absence visible. An AI model that has been trained to avoid a topic can redirect, deflect, or produce a plausible-sounding alternative that gives no indication anything was suppressed. The sycophancy problem in LLMs (models telling users what they want to hear rather than what is accurate) interacts with content policy in ways that are genuinely difficult to untangle: is the model avoiding a topic because of a safety policy, because of a training artifact, or because agreeing with the user’s framing is the path of least resistance?

What AI Content Policy Transparency Would Require

A meaningful transparency framework for AI content policy would need to address multiple layers:

  • Policy publication: Full content policies (not just summaries) made available to users, including the specific categories of restricted content and the rationale for each restriction. Anthropic’s constitution and OpenAI’s Model Spec are partial steps in this direction.
  • Annotation guideline disclosure: The actual instructions given to human raters who shape model behaviour. These are the operational documents that translate abstract principles into concrete training signalsFeedback data generated during AI model training that guides how the model adjusts its behavior; degraded signals produce less accurate models..
  • Refusal explanation: When a model refuses a request, an indication of whether the refusal is based on a hard policy, a soft default, or a contextual judgment, and what the relevant policy is.
  • Auditing: Independent review of training processes, annotation guidelines, and model behaviour, analogous to financial auditing or algorithmic auditing under the EU’s Digital Services Act.
  • Cultural representation: Disclosure of the demographics and geographic distribution of annotation workforces, since these workers’ cultural frameworks directly influence model behaviour.

As of early 2026, no company meets all of these criteria. Anthropic, with its published constitution and relatively detailed public documentation, comes closest. OpenAI’s Model Spec covers policy publication partially. Google remains the least transparent of the three major Western AI labs regarding Gemini’s alignment architecture.

Affiliate Disclosure

The question of what AI models will and will not discuss is not purely academic. If you have found yourself frustrated by refusals on topics you consider legitimate, you are not alone, and the market has responded.

If you want to try AI without the guardrails, Uncensored AI offers unfiltered conversations. (Affiliate link: we earn a commission at no cost to you.)

We mention this not as an endorsement of removing all safety measures (the absolute categories exist for good reasons), but because the existence of demand for less restricted AI is itself evidence that current content policies are failing some legitimate users. The right response is not to pretend the demand does not exist; it is to push for transparency standards that make the restrictions legible, contestable, and accountable.

How was this article?
Share this article

Spot an error? Let us know

Sources