How TPUs Redefined the Hardware of Intelligence
A fresh dive into Google’s TPUs, AI-native chips built for speed, scale, and energy-efficient deep learning
This post may sound a bit strange, perhaps even radically different from the topics we've explored so far in this publication.
Over the past few years, my work has danced across the abstractions of distributed systems, deep learning architectures, and edge deployment strategies.
But more recently, I’ve begun to descend the stack, all the way down, into the physical substrate of computation: silicon, interconnect fabrics, and compiler-hardware co-design.
There’s something quietly elegant (I have to say it, almost poetic) about tracing a matrix multiplication down to its most primitive form: a wave of signals propagating through a grid of logic gates.
High-level abstractions, like layers, tensors, gradients, are ultimately grounded in patterns of electrons and etched transistors. And that realization changed the way I approached artificial intelligence.
Lately, my journey has taken a sharp turn into the foundational questions of intelligence itself, biological or artificial.
As I delved into the mechanics of reasoning systems, non-biological agents, and the cognitive architecture of so-called “conscious systems,” I found myself studying an eclectic set of subjects: from dataflow graphs and learning theory to the algebraic underpinnings of computation.
That journey began, intellectually, with the building blocks: Boolean algebra, combinatorial logic, and λ-calculus.
Then it expanded into continuous mathematics; Topics like manifolds, Riemannian geometry, tensor fields, complex differential equations were my norm for quite some time.
Then I circled back into the discrete domains of group theory, abstract algebra, and automata. My goal wasn’t just merely academic: it was absolutely epistemological. I wanted to understand the architectures of thought, artificial or otherwise.
But along that path, I encountered a blind spot; a massive, physical blind spot that had been sitting beneath the software abstractions I’d studied for years.
That missing piece was obeviously hardware.
As a software-centric developer, I had long avoided diving too much into low-level architecture. Hardware felt very opaque, slow to evolve, too “close to the metal.”
But it slowly became clear to me that it is hardware, not software, that ultimately shapes the boundaries of what software can express and optimize.
Instruction sets, memory hierarchies, cache coherence protocols, and interconnect topology impose structural constraints that directly affect learning efficiency, scaling, and even model architecture.
So I started reading more and more. Papers, datasheets, architecture overviews and so on. I followed obscure Stack Overflow threads trying to understand warp scheduling in CUDA and memory bank conflicts.
I even reached out to engineers working in compilers and chip design just to grasp concepts like register tiling, kernel fusion, and shared memory bank layout.
Then I stumbled across something very different. Something radical.
A fundamentally new approach to AI hardware: the Tensor Processing Unit (TPU).
Designed at Google starting in 2013 (and first deployed in 2015), the TPU was built not as a general-purpose processor, but as a domain-specific accelerator: custom silicon optimized for the high-throughput matrix multiplications at the heart of modern machine learning.
The architecture was co-designed alongside the XLA (Accelerated Linear Algebra) compiler stack to eliminate the inefficiencies of traditional memory hierarchies and enable tightly scheduled, deterministic execution.
Though the project’s public unveiling came in 2017, it was the brilliance of people like Norm Jouppi, Cliff Young, David Patterson, Urs Hölzle, and others, drawing from decades of VLSI design and parallel architecture research, that made TPUs what they are today.
One of my secret heroes, Chris Lattner deserves immense credit for MLIR and compiler infrastructure, though he wasn't directly involved in the original TPU design. That said, his LLVM and Swift contributions helped shape modern compiler ecosystems, which are fundamental for AI.
Back in 2018, I was vaguely familiar with TPUs. I knew they were being used by Google for things like image classification and primitive speech models. But at the time, my focus was elsewhere: economics, then software, then distributed systems. I never took the time to study the architecture seriously.
That changeda lot recently.
What I found was a marvel of architectural clarity; a chip designed from the ground up for AI workloads, with hardware and software in lockstep.
Systolic arrays, massive on-chip scratchpads, compiler-directed memory scheduling, and ultra-low-latency interconnects, all wrapped in a philosophy that prioritizes deterministic throughput over general-purpose flexibility.
That’s where this article begins.
So, Why TPUs?
In 2006, nearly two decades ago, Google quietly began exploring the idea of specialized hardware to accelerate their growing internal workloads.
At the time, CPUs still reigned supreme. GPUs were primarily used for graphics and scientific simulations, and most ASICs (Application-Specific Integrated Circuits) were reserved for narrow domains like networking, signal processing, or encryption.
The idea of designing custom silicon for machine learning wasn’t just fringe: it was seen as premature and risky.
So they shelved it.
For the next several years, Google’s infrastructure relied on CPUs with general-purpose server architectures. It worked…..Until it didn’t.
The Deep Learning Crisis (and Opportunity)
By 2013, a whispered revolution was unfolding in the world of AI. That was a revolution that had been brewing quietly in academic labs and late-night code sprints for years.
At the heart of this shift was a reawakening of an old idea: neural networks.
As said, it wasn’t a new concept. Neural nets had been proposed decades earlier, but for most of their history, they were treated as promising but impractical curiosities; they were computationally expensive and rarely outperforming hand-engineered features.
Then came Geoffrey Hinton, Yann LeCun, and a bold generation of researchers who believed otherwise.
The dam finally broke in 2012.
That year, Hinton’s team from the University of Toronto, including Ilya Sutskever and Alex Krizhevsky, unveiled AlexNet, a deep convolutional neural network that blew away the competition in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC).
Their model didn’t just win by low margin rate; it actually halved the error rate, showing the world that deep learning wasn’t just viable: it was transformational.
“The future is going to be neural networks.”
— Geoffrey Hinton, 2012
AlexNet ran on two NVIDIA GTX 580 GPUs, using data-parallel training to distribute the model.
It was crude by today’s standards, but it revealed a latent truth: if you gave neural networks enough data and enough compute, they would learn. And often, they would learn better than anything else.
This realization exploded across industry. Google, already sitting on the world’s largest datasets, began deploying deep neural networks (DNNs) across its product stack.
The first major inflection point came with Google Voice Search, which transitioned from statistical models to DNN-based acoustic models. Accuracy shot up, latency dropped, and user engagement skyrocketed.
But there was a catch.
The Inference Wall
Google’s internal projections told a startling story: if they were to fully embrace deep learning for inference across products, their datacenter compute demand would increase by a factor of 10.
That wasn’t a metaphor. That was a literal forecast. With traditional CPU-based infrastructure, running large-scale inference for neural networks would have broken their operational model, requiring exponentially more power, rack space, and cooling, and threatening the economic sustainability of their infrastructure.
“It was clear that if we didn't do something, we'd have to build new datacenters just to handle the compute”
— Norman Jouppi, TPU chief architect
It was here, at the edge of crisis, that a new idea took root: what if we could rethink the hardware itself?
Instead of asking general-purpose chips to keep up with increasingly specialized workloads, what if Google built its own chip? A chip engineered not for flexibility, but for one job only: accelerating deep learning.
A Clean-Slate Design for AI
That idea became the Tensor Processing Unit (TPU): a custom Application-Specific Integrated Circuit (ASIC) designed specifically for machine learning inference.
The first version, TPUv1, quietly went into production in 2015. It didn’t support floating-point math. It couldn’t train models. It didn’t need to. All it needed to do was execute inference tasks for models like Google Translate, Photos, and Search, and do it faster and more energy-efficiently than any CPU or GPU could.
At its core, the TPU was built around a radical assumption: that most of deep learning’s computational burden lies in dense and fairly repetitve linear algebra tasks, especially matrix multiplications and convolutions, and that those operations could be executed orders of magnitude more efficiently by tailoring the hardware around them.
But this wasn’t just about silicon.
TPUs were part of a philosophical bet; a software-hardware co-design that required coordination between chip architects, ML researchers, compiler engineers, and systems designers.
It was the realization of three interlocking ideas:
Most ML workloads can be expressed as deterministic, high-throughput matrix operations.
These operations can be statically scheduled and compiled ahead-of-time to eliminate costly memory access unpredictability.
With tight integration between software (e.g. the XLA compiler) and hardware, we can eliminate general-purpose inefficiencies and achieve massive gains in speed and energy use.
As Jouppi later explained in his seminal ISCA paper, TPUs achieved 15-30x performance per watt improvements over contemporary CPUs and GPUs on inference tasks.
And the wins weren’t just in energy. They were in predictability, deterministic execution, minimal cache misses, and software-controlled data flow.
"When you know the dataflow in advance, you don’t need to waste energy guessing."
— Google TPU Engineering Team
More Than an Accelerator
So, you may be wondering, why TPUs?
Because they are more than accelerators. They are an architectural response to a shift in computational gravity. For decades, general-purpose computing was dominant.
But in a world where models like BERT, PaLM, and Gemini demand exaflop-scale operations, the general-purpose model is no longer sufficient.
TPUs are a statement: that AI is not just a workload: it’s a computational regime. One that deserves (and demands) its own silicon.
They are the physical embodiment of a worldview, one where matrix multiplication isn’t a primitive, but a first principle.
It is a place where layers of abstraction, from neural networks to XLA graphs to systolic arrays, coalesce into a vertically integrated stack of intelligence.
Where every watt is accounted for, every memory access optimized, and every chip is built not to support a thousand use cases, but one: machine learning.
And as this article will show, that narrow focus is what enables TPUs to scale. From chips to trays, from pods to superpods, Google’s TPU infrastructure is not just an engineering marvel: it’s a vision of what compute looks like when the architecture is shaped entirely around learning.
TPUs, From the Ground Up
By 2015, deep learning had quietly broken through Google’s infrastructure. Voice search had migrated to deep neural networks, and projections warned of a looming compute crisis.
Serving ML inference from general-purpose CPUs would blow up the datacenter power budget. Even GPUs, flexible but energy-hungry, weren’t enough.
So Google built something different.
It was time to say hi to TPUv1, a chip purpose-built to accelerate inference, not with flexibility, but with precision-targeted throughput. It wasn’t designed to do everything fast. Just the one thing that mattered most: matrix multiplication.
“We weren’t aiming to build a general-purpose processor. We were optimizing for the one thing that mattered.”
— Norman Jouppi, Google Distinguished Engineer
Under the hood:
Process: 28 nm CMOS
Die size: ~331 mm²
Clock: 700 MHz
Power: ~28–40 W TDP
Compute core: 256×256 8-bit MAC systolic array (65,536 units total)
On-chip memory: 28 MiB unified software-managed SRAM
Off-chip memory: 8 GiB DDR3, 34 GB/s bandwidth
Interface: PCIe 3.0 x16; host issues commands to a CISC-style instruction stream
Everything in TPUv1 was optimized for determinism and efficiency.
There was no cache, no speculative execution, no branch prediction, no guessing. Every data movement and instruction was known ahead of time, scheduled deterministically.
This made it not only fast but predictably fast, a crucial feature for low-latency inference in production systems.
Why systolic arrays?
Because they’re simple, dense, and built for predictable dataflow. A systolic array is a grid of multiply-accumulate units (MACs) that pass data in rhythmic pulses—like a heart pumping compute.
During matrix multiplies, once weights and activations enter the array, no further memory accesses are needed. Just synchronized waves of computation, one clock tick at a time.
“The systolic array lets you push data in one side and have answers emerge from the other; no caches, no stalls, no drama.”
— David Patterson, co-architect of RISC and TPUv1 contributor
And it worked.
TPUv1 achieved up to 92 TOPS (8-bit) at under 40 watts. Compared to NVIDIA’s K80 GPU, TPUv1 delivered 15–30× better inference throughput, and 30–80× better performance per watt.
But it had one key bottleneck: memory. The 34 GB/s DDR3 channel couldn’t keep up with the array’s hunger for weights. The systolic array often sat idle, waiting for data. This was a hardware version of the classic software starvation problem.
TPUv1 was a beautiful, elegant idea. But it needed a bigger, more complete stage.
TPUv2: Scaling to Training (2017)
If TPUv1 was a focused blade, TPUv2 was much more like a multi-engine rocket.
With the rise of LSTMs, CNNs, and early transformers, Google’s models weren’t just getting wider: they were training for weeks on massive corpora. Inference was no longer the bottleneck. Training was.
So in 2017, Google released TPUv2:
A new chip, nearly twice as dense, ten times as capable and, crucially, training-capable.
Key architectural upgrades:
Process: 16 nm
Memory: 16 GiB HBM per chip (~600 GB/s bandwidth)
Cores per chip: 2 TensorCores
Systolic array: 128×128 per core (still MAC-based, but tuned for bfloat16)
ISA: VLIW-style micro-ops (bundle up to 8 ops per cycle)
Precision: bfloat16 native support
Memory: On-chip VMEM + SMEM buffers for explicit scratchpad control
Google moved from DDR3 to High Bandwidth Memory (HBM), which was truly a game-changer. The bandwidth jump—from 34 GB/s to ~600 GB/s—unleashed the systolic array's full potential. No more starvation. Now, data could keep up with compute.
The bfloat16 choice is also telling. Unlike FP16, which loses range, bfloat16 retains the same exponent width as FP32, making it far more numerically stable for deep learning workloads. You sacrifice precision, not scale.
“Bfloat16 lets you halve the storage and double the throughput—without losing model accuracy.”
— Jeff Dean, Google SVP of Research
But TPUv2 wasn’t just a chip. It was a whole intricate system.
Google started wiring TPUs into pods: 64 chips interconnected with Inter-Core Interconnect (ICI) links, forming mesh topologies that enabled fast all-reduce and parameter syncing. Each pod could deliver 11.5 PFLOPS of training compute.
And it wasn’t just about hardware. TPUv2 was built in lockstep with XLA, Google’s Accelerated Linear Algebra compiler. XLA performed ahead-of-time graph compilation, transforming static TensorFlow (or JAX) graphs into memory-optimized binary blobs tailored for TPU hardware.
The compiler did pretty much everything:
Data layout. Memory scheduling. Instruction selection. Communication patterns. The hardware and software were not just compatible: they were co-evolved.
Why This Journey Matters
From TPUv1 to TPUv2, we see a shift in philosophy—from inference to training, from micro-efficiency to distributed systems. But beneath the changes, one truth remains constant:
If you can make matrix multiplication faster, you can make machine learning faster.
Google didn’t build TPUs just to save power. They built them to build a future; a future where hardware doesn't chase software, but where software flows from the shape of the silicon.
And as we climb from chips to racks to superpods, we’re not just scaling compute.
We’re scaling thought itself.
TPUv4 Chip: The Heart of Compute
Let’s zoom in, way in, down to the beating heart of Google’s AI infrastructure: the TPUv4 chip.
Not a metaphor, but an actual sliver of silicon that lives inside a metal tray in a climate-controlled rack, running models that shape everything from search to translation to large language models.
Each TPUv4 chip is composed of two TensorCores, and each of those TensorCores is effectively a self-contained engine for neural computation.
Think of them as vertically integrated processors, where matrix algebra, memory, control flow, and synchronization are all co-located, compact, and streamlined.
The Matrix Multiply Unit (MXU)
At the center sits the MXU, a 128×128 systolic array of multiply-accumulate units: 16,384 individual arithmetic cells, pulsing in lockstep. This is where the heavy lifting happens.
Dense matmuls, convolutions, batched dot products: all of it flows through this array, clocked and orchestrated with brutal regularity.
Systolic means rhythm. Data enters from one side, weights from another, and results propagate diagonally like a wavefront.
The beauty of this design is that no memory is accessed during computation: intermediate values simply move across the array, cell by cell, minimizing data movement and maximizing arithmetic intensity.
This is a machine built not to think, but to multiply.
The Vector Processing Unit (VPU)
Alongside the MXU is the VPU, less famous, but just as crucial. If the MXU handles the "big math," the VPU does everything else: elementwise operations, nonlinearities like ReLU, add/mul/fused ops, layer normalization, you name it.
In modern neural networks, most operations aren’t giant matrix multiplies. They’re small, repetitive, scalar or vector-wise functions applied across tens of millions of activations. The VPU exists to eat through these operations at speed, feeding or cleaning up behind the MXU.
It’s also one of the reasons TPUs can handle end-to-end models, because even the “non-matrix” parts are accounted for in silicon.
The On-Chip Memory: VMEM and SMEM
This is where TPUs differ radically from GPUs.
Instead of large cache hierarchies and speculative prefetching, TPUv4 relies on software-managed scratchpad memory, in two flavors:
VMEM (Vector Memory): ~32 MiB per core, dedicated to tensors. Think of it as a fast-access staging ground for inputs and outputs, mapped directly by the compiler.
SMEM (Scalar Memory): ~10 MiB for program control; loop counters, scalar values, control signals, state.
Together, they create a deterministic memory model. Nothing is left to chance. No cache misses. No branching chaos. The compiler knows exactly where each tensor will live and for how long.
Simplicity Over Flexibility
Compared to a GPU, a TPU core might look underpowered. Fewer cores. No threads. No warp scheduling. No shared memory bank juggling.
But that’s the point.
Where GPUs aim to be general-purpose accelerators, TPUs are purpose-built engines. Every block inside the chip exists to serve one assumption: that deep learning is dominated by dense linear algebra, and that we can compile those operations ahead of time.
This means TPUs don’t have to improvise at runtime. They execute a known choreography, one that’s been statically unrolled and scheduled by the XLA compiler before the first bit hits the chip.
The Memory Inversion
There’s one subtle but profound inversion in TPU design: the ratio of on-chip memory to high-bandwidth memory (HBM).
GPUs typically have massive HBM capacity (e.g., 80–96 GiB) and small L1/L2 caches. They assume cache misses and rely on aggressive runtime management.
TPUs, especially v4, flip this. They keep large scratchpads on-chip and use HBM more sparingly, because they trust that the compiler will prevent the need for dynamic memory access.
Why does this matter?
Because in modern semiconductor physics, moving data costs more than computing on it. The TPUv4 chip design leans into this: reuse data locally, stream it linearly, avoid external memory like the plague.
A Purpose-Built Engine
The result is a chip that may look simple on a block diagram, but hides enormous architectural intent.
It’s not trying to run every program. It’s trying to run one class of program, very, very well. And in doing so, it lays the foundation for scaling (from one chip to thousands) because every assumption it makes at the micro level translates into deterministic behavior at the macro level.
So when you call jax.jit()
and dispatch a function to a TPU, what actually happens isn’t just offloading to hardware. You’re passing your graph to a runtime that knows exactly how to lay it out across MXUs, VPUs, VMEM, and SMEM, and do so with zero guesswork.
You’re not just programming a chip. You’re speaking in the language of architecture.
And that’s what makes the TPUv4 chip so fascinating. Not its FLOPs or frequency or transistor count, but the conviction in its design. The idea that if you give up generality, and build for a well-defined mathematical world, you can unlock massive efficiency.
The TPUv4 isn’t a general-purpose machine. It’s a neural arithmetic engine: custom-built for a world where AI models are the dominant workloads and every microjoule counts.
And the more you understand what lives inside it, the more you realize: this isn’t just clever engineering.
It’s more like a statement.
Foundational Hardware for Foundational Models
TPUs represent one of the most ambitious hardware projects of the deep learning era. But their true innovation isn’t just the MXU or the 3D torus. It’s the philosophy:
“Make the hardware simple, and make the compiler do the thinking.”
In a world moving fast toward trillion-parameter models and energy-conscious compute, this philosophy feels not just relevant, but urgent.
As I continue diving into architectures, ranging from edge to cloud, from inference to training, I’m struck by one insight: great systems don't just scale well. They reflect the assumptions and ambitions of the people who built them.
And TPUs are nothing if not opinionated.
Thanks for reading. If you enjoyed this, feel free to share or subscribe. I’ll be diving more into more system architectures, compiler designs, and AI hardware in future posts.