Skip to content
Artificial Intelligence Explainers 16 min read

How Computers Parse Language: The Mechanics of Natural Language Processing

From tokenization to attention layers, the five-stage pipeline that transforms raw text into language-model output reveals both the ingenuity of modern language models and their fundamental constraints.

Visual representation of natural language processing mechanics showing data transformation
Reading mode

When you type a question into ChatGPT or ask your phone to set a timer, text-processing software turns language into machine-readable representations. In modern language models, those representations are compared with patterns learned from large text corpora to generate output that can sound human. Understanding the natural language processing mechanics behind this transformation reveals both the ingenuity of modern AI and its fundamental limitations.

This process can be understood through five core stages, each solving a specific problem in representing and transforming language. Modern transformer systems are not built from hand-coded grammar rules alone; they learn statistical patterns from enormous quantities of text.

Step 1: Tokenization Breaks Text Into Pieces

Computers cannot process raw text directly. The first step in natural language processing mechanics is tokenization: splitting a sentence into discrete chunks called tokens.[s] A token might be a whole word, part of a word, or even a single character, depending on the system.

A common approach, Byte Pair Encoding (BPE), originated as a data compression technique before being adapted for language-model tokenization.[s] BPE starts with bytes or characters and iteratively merges the most frequent pairs until reaching a target vocabulary size.

Consider the sentence “unhappiness is common.” BPE might split “unhappiness” into “un,” “happiness,” or even “un,” “happ,” “iness,” depending on what patterns appeared most frequently in training data. This flexibility lets models handle words they have never seen before by breaking them into familiar pieces.

The vocabulary choice matters enormously. As one research group put it, the vocabulary “needs to allow the encoding of enough of the semantic information contained in a text to achieve the natural language processing tasks the model is designed to address.”[s] Too small a vocabulary forces awkward splits; too large wastes computational resources on rare tokens.

Whitespace presents a surprisingly thorny problem. English speakers assume spaces reliably separate words, but this assumption fails across languages and even within English. Multi-word expressions like “by the way” or “search engine” function as single semantic units.[s] Chinese and Japanese often do not use whitespace between words. Recent research on “superword” tokenizers that bridge whitespace boundaries achieved up to 33% fewer tokens on average while improving model performance by 4.0% across 30 benchmark tasks.[s]

Step 2: Embedding Transforms Tokens Into Vectors

Once text is tokenized, each token index gets mapped to a learned vector: a list of numbers used by the model to predict and relate tokens.[s] This embedding step is where language shapes perception for the model. Tokens used in similar contexts often end up in nearby regions of the embedding space.

Modern large language models use embedding dimensions of thousands. Llama 3’s 70 billion parameter version uses 8,192-dimensional vectors; Gemma 3’s 27 billion parameter version uses 5,376 dimensions.[s] The representation as a whole captures learned features, though researchers often cannot assign a stable human meaning to any single component.

Position and context matter too. The same token can contribute different meanings in phrases like “river bank” and “bank account,” and later attention layers help disambiguate it. Early systems added positional information directly to token embeddings; modern architectures like Llama apply positional encoding within the attention mechanism itself through a technique called Rotary Positional Embeddings.[s]

Step 3: Attention Lets Tokens Communicate

The natural language processing mechanics of attention represent “a fundamental paradigm shift in neural network architectures, enabling models to selectively focus on relevant portions of input sequences through learned weighting functions.”[s] Before attention-based transformers, many sequence models processed text sequentially, struggling to connect words far apart in a sentence.

Attention works like a database lookup. Each token generates three things: a query (what it is looking for), a key (what it offers), and a value (the information it carries). The system compares each query against all keys, weights the results, and combines the corresponding values.[s] When processing the word “it” in “The cat sat on the mat because it was tired,” attention lets “it” look back and determine that it refers to “cat” rather than “mat.”

Modern transformers use “multi-head attention,” running multiple attention operations in parallel. Each head can focus on different relationships: one might track grammatical agreement, another semantic similarity, a third positional patterns. The attention heads “read information from the residual stream of previous tokens via the query-key subspace projections, and then write the attended information back to the current position.”[s]

The “residual stream” deserves mention here. It acts as “the central highway for information propagation,” preserving a shared memory state that each layer updates.[s] Residual connections also help information and gradients move through deep networks.

Step 4: Feed-Forward Networks Store Knowledge

Between attention layers sit feed-forward networks (FFNs), and researchers increasingly view these as the model’s knowledge storage. FFNs “are often conceptualized as key-value memories, where the first layer projects the stream into a high-dimensional state (detecting patterns or ‘Knowledge Keys’) and the second layer writes the retrieved knowledge back to the stream.”[s]

When you ask a language model who wrote Hamlet, the attention mechanism helps represent relationships in the question, while feed-forward layers are often modeled as storing or retrieving facts such as Shakespeare wrote Hamlet. This division of labor, attention for relationships and FFNs for facts, helps explain why models can be surprisingly good at reasoning while making factual errors (and vice versa).

Step 5: Output Generation

After passing through the stack of attention and FFN layers, the final vector gets converted back into a probability distribution over the vocabulary. The model assigns a likelihood to every possible next token, from common words to rare symbols. During generation, it samples from this distribution (or picks the highest-probability option) and repeats the entire process for the next token.

This auto-regressive generation, predicting one token at a time, explains both the fluency and the failure modes of modern language models. Each prediction conditions on all previous tokens, maintaining coherence over long passages. But errors compound: an early mistake can derail an entire response.

Natural Language Processing Mechanics: The Limits

Understanding these natural language processing mechanics reveals constraints that researchers still grapple with as models scale.

Attention has quadratic complexity: processing a sequence of length n requires O(n²d) operations, where d is the embedding dimension.[s] Double your context length and you quadruple this part of the computation. This is one reason long-context modeling remains computationally expensive, and why researchers continue pursuing “efficient attention” variants.

More philosophically troubling: these systems remain black boxes. “Our theoretical understanding of LLMs remains disproportionately nascent, forcing these systems to be treated largely as black boxes. They function exceptionally well, yet their internal mechanisms of operation, the how and why behind their efficacy, remain elusive.”[s]

Some researchers have explored whether language models develop something like theory of mind, the ability to model what others know and believe, a capacity that emerges in human children around age four. The evidence remains mixed; models can pass some theory of mind tests while failing others in ways that suggest reliance on shallow pattern matching rather than human-like reasoning.[s]

Perhaps most surprisingly, language models struggle with grammaticality in ways their fluent outputs obscure. “LMs do not (nor are they designed to) categorically assign lower probabilities to ungrammatical sentences than grammatical ones; LM probabilities are generally poor at distinguishing grammatical from ungrammatical sentences.”[s] They can produce grammatical text even though their raw probabilities are not a reliable overall grammaticality test.

Recent MIT research found that models do encode grammaticality information, but not in their output probabilities. “LMs acquire to some extent an implicit grammaticality distinction within their hidden layers,” accessible through specialized probes but not through the standard generation process.[s]

The Emergent Intelligence Problem

As models scale, some capabilities and failure modes become more prominent or take new forms, including in-context learning, scaling-law behavior, and sudden “aha moments” during training. Hallucination itself is not exclusive to large models, it occurs across neural natural-language generation systems, but it takes new shapes at scale.[s] These “emergent phenomena” resist prediction and explanation, appearing at scale thresholds researchers cannot reliably forecast.

Data composition shapes these capabilities in ways researchers are only beginning to understand. Models trained on mixtures of web text, books, code, and scientific articles outperform those trained on any single source.[s] Why diverse training data produces more capable models remains an open question, though it likely relates to the transfer of reasoning patterns across domains.

General computability results such as Rice’s theorem limit what can be proven automatically about arbitrary program behavior.[s] Open-ended neural systems inherit a version of that problem. Practical verification can prove bounded properties, but it cannot mathematically verify that a language model will never hallucinate, never produce harmful content, or never fail in safety-critical applications.[s]

What This Means

The natural language processing mechanics described here represent a specific bet: that statistical patterns over tokens can substitute for large amounts of explicit linguistic engineering. This bet has paid off remarkably well. Models trained to predict the next word have learned behaviors consistent with grammar, facts, reasoning patterns, and stylistic conventions, all without explicit instruction in each of them.

But the mechanisms remain alien to human language processing. We do not think in attention weights or feed-forward activations. The convergence of capability, AI systems that produce human-like language, does not imply convergence of mechanism. Understanding the natural language processing mechanics behind these systems helps us appreciate both their achievements and their inherent limitations.

Modern language models transform text into continuous representations through a pipeline of learned transformations: tokenization, embedding, attention, feed-forward networks, and output projection. Each stage implements specific inductive biases that enable statistical language modeling at scale. The natural language processing mechanics underlying these transformations determine both the capabilities and the failure modes of transformer-based systems.

Tokenization: Subword Decomposition

Natural language processing mechanics begin with tokenization, converting character or byte sequences into discrete tokens from a fixed vocabulary. Contemporary systems usually use subword tokenization, often Byte Pair Encoding (BPE) or variants like WordPiece. As the SuperBPE paper puts it, tokenizers “segment a stream of bytes into a sequence of tokens in the LM vocabulary.”[s]

BPE originated as a data compression algorithm and was adapted for NLP; the SuperBPE paper notes that “The algorithm originated in 1994 in the field of data compression.”[s] In standard BPE training, neighboring token-pair frequencies are recorded, the most frequent pair is merged into a new token, and the process repeats until the vocabulary reaches the target size.

Vocabulary selection involves tradeoffs. The vocabulary “needs to allow the encoding of enough of the semantic information contained in a text to achieve, given a model, the natural language processing tasks this model is designed to address.”[s] Larger vocabularies reduce sequence lengths but increase embedding parameters and long-tail sparsity.

Standard BPE enforces whitespace boundaries, preventing tokens that span word boundaries. This assumption fails for languages without whitespace delimiters. “Whitespace is not a reliable delimiter of meaning, as evidenced by multi-word expressions (e.g., by the way), cross-lingual variation in the number of words needed to express a concept, and languages that do not use whitespace at all.”[s]

Recent work on SuperBPE relaxes the subword constraint via a curriculum: first learning subwords with whitespace pretokenization, then learning superwords without it. This achieves “up to 33% fewer tokens on average than BPE” while improving downstream performance by 4.0% across 30 tasks and reducing inference compute by 27-33%.[s] The efficiency gains derive from capturing multi-word expressions as single tokens: “SuperBPE tokens often correspond to multi-word expressions in English, i.e., word sequences that function as a single semantic unit.”[s]

Embedding: Token-to-Vector Mapping

“Modern Natural Language Processing methods based on attention mechanisms process textual information not in the form of strings of characters, but as sequences of vectors.”[s] The embedding layer maps each token index to a learned vector. Contemporary models use high-dimensional embeddings: Llama 3 70B uses d=8192, Gemma 3 27B uses d=5376.[s]

Position encoding addresses self-attention’s permutation equivariance. “The permutation equivariance property of self-attention is proven, and its implications for positional encoding are thoroughly examined.”[s] Without positional information, the model cannot distinguish “dog bites man” from “man bites dog.” Early transformers added sinusoidal positional embeddings at the input; modern architectures like Llama use Rotary Positional Embeddings (RoPE), applying position-dependent rotations to query and key vectors within attention.

Understanding how embedding geometry relates to semantics remains active research. How language shapes perception at the embedding level, whether similar embeddings reflect similar meanings in human terms, involves complex questions about the relationship between statistical co-occurrence and genuine semantics.

Attention: Content-Based Information Routing

“Attention mechanisms represent a fundamental paradigm shift in neural network architectures, enabling models to selectively focus on relevant portions of input sequences through learned weighting functions.”[s] Scaled dot-product attention computes:

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V

where Q, K, V are query, key, and value projections of the input, and d_k is the key dimension.

“Attention mechanisms, the building blocks of the Transformer architecture, allow encoding of semantic information between tokens through a database-like structure. Indeed, a database can be seen as a set of (key, value) tuples, such that when we submit a query q to the database, the database returns the value v associated to the tuple whose key is equal to the query.”[s]

Multi-head attention runs H parallel attention operations with separate learned projections, concatenating and projecting the results. This allows different heads to specialize in different relationship types. “Attention heads read information from the residual stream of previous tokens via the query-key subspace projections, and then write the attended information back to the current position via the value and output projections.”[s]

The residual stream architecture, where each layer adds its output to a running sum, enables gradient flow and compositional feature construction. “The residual stream acts as the central highway for information propagation. It preserves a shared memory state that is iteratively updated by the blocks.”[s]

Computational complexity is O(n²d) for sequence length n and dimension d.[s] This quadratic scaling with sequence length remains a major bottleneck for long-context modeling and has motivated extensive research into sparse and linear attention variants.

Feed-Forward Networks: Pattern-Activated Knowledge Retrieval

Each transformer block interleaves attention with a position-wise feed-forward network. “Feed-Forward Networks are often conceptualized as key-value memories, where the first layer projects the stream into a high-dimensional state (detecting patterns or ‘Knowledge Keys’) and the second layer writes the retrieved knowledge back to the stream.”[s]

The standard FFN applies:

FFN(x) = W_2 * activation(W_1 * x)

with the intermediate dimension often larger than the model dimension in standard Transformer-style formulations. Modern variants like SwiGLU add gating mechanisms. The first projection can be interpreted as matching input patterns against learned “knowledge keys”; the second projection retrieves associated knowledge.

This framing suggests attention handles relationship computation while FFNs store factual knowledge, though the boundary blurs in practice. Mechanistic interpretability research attempts to localize specific facts to specific neurons, with mixed success.

Natural Language Processing Mechanics: Fundamental Constraints

Despite empirical success, “our theoretical understanding of LLMs remains disproportionately nascent, forcing these systems to be treated largely as black boxes. They function exceptionally well, yet their internal mechanisms of operation, the how and why behind their efficacy, remain elusive.”[s]

Several emergent phenomena resist theoretical explanation: “LLMs exhibit numerous emergent phenomena that do not appear in smaller models, such as hallucination, in-context learning (ICL), scaling laws, and sudden ‘aha moments’ during training.”[s] These capabilities can be difficult to predict from smaller models.

Data composition matters. “Models trained on a carefully curated mixture of data from multiple sources (e.g., web text, books, code, scientific articles) consistently outperform those trained on monolithic corpora.”[s] The theoretical justification for mixture efficacy remains incomplete.

Grammaticality presents a surprising gap. “LMs do not (nor are they designed to) categorically assign lower probabilities to ungrammatical sentences than grammatical ones; LM probabilities are generally poor at distinguishing grammatical from ungrammatical sentences.”[s] Models can generate fluent text even though raw string probability is not a reliable grammaticality judgment.

Probing experiments reveal implicit grammatical knowledge. “A linear probe trained on synthetic perturbation data can outperform probability-based grammaticality judgments in both minimal-pair and non-minimal-pair-based grammaticality judgment benchmarks.”[s] The grammaticality distinction exists in hidden representations even when absent from output probabilities: “LMs acquire to some extent an implicit grammaticality distinction within their hidden layers.”[s]

Some researchers explore whether language models develop theory of mind, the capacity to model others’ beliefs and knowledge states. The evidence is mixed; models pass some false-belief tests while failing others, suggesting they may be exploiting surface correlates rather than possessing genuine mental state attribution.[s]

Formal verification can prove bounded properties of particular systems, but general computability limits such as Rice’s theorem prevent automatic proofs of arbitrary program behavior.[s] Open-ended language models inherit a version of that boundary: we cannot mathematically verify that a model will never hallucinate, never produce harmful output, or always reason correctly.[s]

The Path Forward

The natural language processing mechanics described here embody a specific hypothesis: that autoregressive prediction over tokens, scaled sufficiently, can yield useful language behavior. The hypothesis has proven remarkably productive, yielding systems that pass professional exams, write working code, and engage in extended dialogue.

Yet the gap between empirical capability and theoretical understanding persists. We can describe what attention computes but not why specific attention patterns emerge. We can measure that data mixtures improve performance but not derive optimal mixture proportions from first principles. The natural language processing mechanics of transformers are well-specified mathematically but remain opaque as cognitive or linguistic systems.

How was this article?
Share this article

Spot an error? Let us know

Sources