Apple’s Neural Engine has grown 63-fold in processing power over seven years. That trajectory explains why on-device AI iPhone capabilities have shifted from Face ID authentication to running billion-parameter language models without an internet connection.[s] The next generation of iPhones is expected to push this further, with architectural changes that could make more AI tasks local rather than cloud-dependent.
The On-Device AI iPhone Shift
When Apple introduced the A11 Bionic in 2017, its Neural Engine delivered 0.6 trillion operations per second. That was enough to power Face ID. By 2024, the M4’s Neural Engine had reached 38 trillion operations per second, enabling the chip to run transformer-based large language models entirely on-device.[s]
The iPhone 17’s A19 chip, released in September 2025, marked a specific inflection point. Independent benchmarks from Argmax found up to 3.1x speed improvements on GPU inference workloads compared to the iPhone 16 Pro.[s] Apple’s Foundation Model, a 3-billion-parameter transformer, now runs on the Neural Engine for most tasks.[s]
This represents a strategic choice. Jon Peddie Research characterized Apple’s approach: “Apple’s strategy is to enable AI on device enhancing privacy and mobile immediacy.”[s] The company is putting its investment behind AI at the edge because “iPhones, iPads, and watches are the edge, where Apple’s revenue is currently residing.”[s]
On-Device AI iPhone: Why Memory Bandwidth Matters More Than Processing Power
The common assumption is that edge devices lack compute power. They don’t. According to Meta AI researcher Vikas Chandra, “Mobile NPUs now deliver serious TOPS,” with the Apple A19 Pro reaching approximately 35 trillion operations per second.[s]
The deeper constraint is memory bandwidth. Mobile devices have 50-90 GB/s; data center GPUs have 2-3 TB/s. Chandra notes that “for LLM inference, this gap is decisive because decode is memory-bound: you load the entire model weights for each token generated.”[s]
Available RAM is typically limited to under 4GB on high-end devices due to the need to coexist with other services.[s] This limits both maximum model size and the suitability of approaches like mixture-of-experts architectures.
Apple has attacked this constraint through two approaches. First, patented compression techniques: Apple’s Neural Engine patent US11604975B2 covers ternary computation modes that reduce memory bandwidth requirements by 50%.[s] Second, unified memory architecture that eliminates data transfer penalties between CPU, GPU, and Neural Engine.
The M5 Sets the Template
Apple’s M5 chip, announced in October 2025, introduced Neural Accelerators directly into each GPU core. The official announcement claimed “over 4x peak GPU compute compared to M4, and over 6x peak GPU compute for AI performance compared to M1.”[s]
Apple’s machine learning research team published benchmarks using MLX, their open-source framework. The results showed the M5 achieving up to 4x speedup for time-to-first-token in language model inference.[s] The M5 can generate the first token from a dense 14-billion-parameter model in under 10 seconds, and from a 30-billion mixture-of-experts model in under 3 seconds.[s]
Subsequent token generation remains bounded by memory bandwidth. Apple’s benchmark showed 19-27% performance improvement compared to M4, matching the 28% increase in unified memory bandwidth from 120 GB/s to 153 GB/s.[s]
The M5 makes every compute block AI-optimized: “The faster 16-core Neural Engine delivers powerful AI performance with incredible energy efficiency, complementing the Neural Accelerators in the CPU and GPU.”[s] Developers can program these Neural Accelerators directly using Tensor APIs in Metal 4.[s]
What iPhone 18 Rumors Indicate
The A20 Pro chip expected in the iPhone 18 Pro is rumored to move from 3nm to TSMC’s first-generation 2nm node. MacRumors reports that “the A20 chips could be up to 15 percent faster and 30 percent more efficient than A19 chips.”[s]
The more significant change may be packaging. The A20 is widely expected to use wafer-level multi-chip module technology that would “place RAM on the same wafer as the CPU, GPU, and Neural Engine.”[s] This packaging “is rumored to reduce the distance data travels between the Neural Engine and memory,” resulting in “lower power draw per operation and lower latency per inference.”[s]
The 12GB of LPDDR5 RAM rumored across Pro models would enable “larger persistent model weights,” potentially meaning “AI responses that feel immediate rather than delayed.”[s]
There’s a cost dimension to this shift. TSMC has apparently told Apple that “2nm chip pricing will be at least 50 percent higher than 3-nanometer processors.”[s] This may explain why advanced capabilities tend to appear first in Pro models.
The Limits of On-Device
On-device AI iPhone capabilities work well for specific use cases: latency-sensitive tasks where 200-500ms cloud round-trips break the experience, privacy-critical operations where data that never leaves the device can’t be breached, and high-volume applications where cloud inference costs accumulate rapidly.
But there are limits. Chandra notes: “if your use case requires frontier reasoning, broad world knowledge, or long multi-turn conversations, cloud is still the better choice.”[s]
This creates a hybrid model. The Argmax team observed that “the Neural Engine will stay the clear choice for on-device inference at scale” for energy efficiency and compression acceleration, while GPU-based acceleration enables more developer control.[s]
The interesting pattern: Apple improves GPU and Neural Engine in alternating years. The A19 generation emphasized GPU neural accelerators. Based on this cadence, the A20’s Neural Engine may be the next significant leap.[s]
What This Changes
PatSnap’s roadmap analysis frames the direction as enabling iPhone to run GPT-3.5-class models entirely on-device.[s] If achieved, this would represent a capability shift in on-device AI iPhone performance: the phone in your pocket running inference workloads that were typically served from cloud or data-center systems three years ago.
Unlike software updates that can be designed to degrade device performance over time, these hardware investments represent permanent capability increases. Each generation’s Neural Engine improvements accumulate.
The Apple10 GPU architecture in the A19 doubles FP16 throughput compared to previous designs and introduces per-core neural accelerators that perform tensor and matrix operations directly on the GPU pipeline.[s] This enables graphics and machine learning kernels to share execution resources while developers work with a unified programming model.
Whether any of this matters depends on what Apple ships in software. The A20 is rumored to build hardware capacity; iOS 27’s AI feature set will determine what fills it.[s] The chip enables the capability. The operating system decides whether users see it.
The On-Device AI iPhone Architecture
Apple’s Neural Engine grew from 0.6 TOPS in the A11 Bionic (2017) to 38 TOPS in the M4 (2024). The most dramatic single jump came with the A12 Bionic in 2018: moving to TSMC’s 7nm process and expanding from 2 cores to 8 cores produced 5 TOPS, an 8.3x performance increase in one generation.[s]
The A14 Bionic (2020) introduced the 16-core Neural Engine architecture that became the template for all subsequent M-series chips. Running on TSMC’s 5nm process with 11.8 billion transistors, it delivered 11 TOPS. The A17 Pro (2023) pushed the same 16-core design to 35 TOPS on TSMC’s 3nm N3B process.[s]
The iPhone 17’s A19 chip represents a specific architectural inflection. Independent benchmarks from Argmax measured up to 3.1x GPU speedup versus iPhone 16 Pro, against Apple’s marketed claim of up to 4x.[s] The discrepancy likely reflects the difference between peak theoretical throughput and real-world inference workloads.
Jon Peddie Research documented the underlying changes: “Apple10 doubles FP16 throughput compared to previous designs and introduces per-core ‘neural accelerators’ that perform tensor and matrix operations directly on the GPU pipeline.”[s]
Memory Bandwidth as the Binding Constraint
Meta AI researchers Vikas Chandra and Raghuraman Krishnamoorthi quantified the fundamental limitation in their 2026 survey of on-device LLMs: “Mobile devices have 50-90 GB/s memory bandwidth; data center GPUs have 2-3 TB/s. That’s a 30-50x gap.”[s]
For LLM inference, this gap is decisive because decode is memory-bound: the model weights must be loaded for each token generated, leaving compute units idle waiting for memory. Chandra notes that “available RAM is typically limited to <4GB even on high-end devices due to the need to co-exist with other services."[s]
Apple’s response has been architectural. The unified memory system eliminates data transfer penalties between discrete memory pools. Apple’s Neural Engine patent US11604975B2 covers ternary computation modes (−1, 0, +1) for compressed neural network models, reducing memory bandwidth requirements by 50%.[s]
The M5 increased unified memory bandwidth to 153 GB/s from M4’s 120 GB/s. Apple’s MLX benchmarks showed this translating directly: “Generating subsequent tokens is bounded by memory bandwidth, rather than by compute ability. On the architectures we tested, the M5 provides 19-27% performance boost compared to the M4, thanks to its greater memory bandwidth.”[s]
M5 Neural Accelerator Integration
The M5, announced October 2025, introduced Neural Accelerators directly into GPU cores. Apple’s press release: “The 10-core GPU features a dedicated Neural Accelerator in each core, delivering over 4x peak GPU compute compared to M4, and over 6x peak GPU compute for AI performance compared to M1.”[s]
Apple’s machine learning research team published MLX benchmarks. Time-to-first-token, which is compute-bound, showed up to 4x speedup versus M4. The M5 achieved TTFT under 10 seconds for a dense 14B architecture and under 3 seconds for a 30B MoE.[s]
The architecture makes every compute block AI-optimized. Apple stated: “The faster 16-core Neural Engine delivers powerful AI performance with incredible energy efficiency, complementing the Neural Accelerators in the CPU and GPU to make M5 fully optimized for AI workloads.”[s]
The programming model matters for on-device AI iPhone deployment. Metal 4 introduces Tensor APIs that allow developers to program Neural Accelerators directly.[s] This contrasts with the Neural Engine, which Argmax characterized as feeling “like black magic to most developers” for peak performance.[s]
A20 Architecture Speculation
The A20 Pro expected in iPhone 18 Pro is rumored to move from 3nm to TSMC’s first-generation 2nm node. MacRumors reports projections of “up to 15 percent faster and 30 percent more efficient than A19 chips.”[s]
The more significant architectural change may be wafer-level multi-chip module packaging. This would “place RAM on the same wafer as the CPU, GPU, and Neural Engine, rather than as a separate chip connected by longer signal paths.”[s]
WMCM packaging “is rumored to reduce the distance data travels between the Neural Engine and memory,” resulting in “lower power draw per operation and lower latency per inference.”[s] Given that decode is memory-bound, reduced memory latency could meaningfully improve token generation rates beyond what TOPS improvements alone would suggest.
The 12GB LPDDR5 RAM rumored for Pro models addresses the available-RAM constraint Chandra identified. Larger persistent model weights could remain resident rather than being evicted and reloaded between tasks.[s]
TSMC has reportedly told Apple that “2nm chip pricing will be at least 50 percent higher than 3-nanometer processors because of the cost of manufacturing and equipment.”[s] This cost structure may restrict 2nm to Pro models only.
On-Device vs. Cloud Tradeoffs
Chandra’s framework identifies where on-device AI iPhone makes sense: latency-critical tasks where “cloud round-trips add 200-500ms before you see the first token,” privacy-sensitive operations, and high-volume applications where cloud inference costs accumulate.[s]
The limits are explicit: “if your use case requires frontier reasoning, broad world knowledge, or long multi-turn conversations, cloud is still the better choice.”[s]
Apple’s 3-billion-parameter Foundation Model runs on the Neural Engine “for several good reasons: top energy-efficiency to maximize battery life, natively accelerated advanced compression techniques and higher peak throughput.”[s] The Argmax team observed an alternating pattern where Apple improves GPU and Neural Engine in alternating years, making the A20’s Neural Engine the next expected leap.[s]
Strategic Implications
PatSnap’s roadmap analysis states the strategic objective as enabling iPhone to run GPT-3.5-class models entirely on-device.[s] This would represent capability convergence between mobile and workloads that were typically served from cloud or data-center systems circa 2023.
Jon Peddie Research frames the business logic: “Apple is putting its investment behind AI at the edge. And iPhones, iPads, and watches are the edge, where Apple’s revenue is currently residing.”[s]
The Apple10 GPU architecture enables “graphics and ML kernels to share execution resources and memory bandwidth while developers work with a unified programming model.”[s] This integration reduces context-switch penalties when scheduling tasks across CPU, Neural Engine, and GPU.
Hardware builds capacity; software determines utilization. The A20 chip rumors describe architectural improvements; what Apple Intelligence features iOS 27 ships will determine whether users experience those improvements.[s] Researchers like Yann LeCun have proposed alternative AI architectures that challenge whether transformer-based approaches will remain dominant, but Apple’s current hardware roadmap is optimized for the transformer inference workloads that define the present moment.



