Why Matrix Multiplication Reveals the Limits of Optimization

How a simple operation taught us the boundaries of programming language efficiency and the hidden beauty of mathematical structure.

Lorenzo Bradanini and Lorenzo Tettamanti

Oct 06, 2025

One Problem, Two Worlds

Basically every programmer on earth, at some point, has to multiply at least two matrices.

It’s the bread and butter of data science, the heartbeat of graphics rendering, the hidden skeleton of neural networks.

Behind every machine learning model or 3D engine lies a massive, tireless loop that multiplies and sums numbers; we’re talking about billions of them, every single second.

But here’s the twist: not all matrix multiplications are created equal.

In the world of programming, how fast your computer multiplies matrices depends not only on your CPU speed or the amount of RAM you have: it depends on how you express the operation, and what abstractions your programming language hides or exposes.

So, let’s tell a story. A story of two programmers (We will call them Clara and Devin for simplicity purposes) each trying to write the fastest possible matrix multiplication.

One fully believes in the power of low-level control.
The other, in the raw intelligence of modern compilers.

The Two Compilers

The busy city of Computeopolis is buzzing with excitement. The University has announced a brand new challenge: implement a matrix multiplication that multiplies two N×N matrices as fast as possible.

The only rule is that each competitor must use a different programming language.

Clara, a systems engineer with an (almost) obsessive love for control, picks C. She carefully writes every loop by hand, carefully ordering the indices to ensure that data sits nicely in the CPU cache.

She aligns memory manually, unrolls loops, and sprinkles restrict keywords like fairy dust to help the compiler vectorize her code. It’s called “bare metal” elegance. Every cycle counts.

Devin, on the other hand, is a competitive computer science researcher who believes in the power of abstraction. He picks Python with NumPy, writing just a single line:

C = A @ B

No loops here. No trace of pointers. No micro-optimizations. Just pure math as it’s meant to be written.

The audience laughs. “You can’t be serious,” they say. “That’s not even real programming.” But Devin starts to smile, and he has plenty of reasons to do so. He thinks he knows something they don’t.

The Benchmark Begins

The tests start with small matrices: tiny 10×10 grids.

Clara’s C code blazes through them. The compiler’s optimization flags are set to -O3, and every instruction is pure assembly artistry. Her code basically runs hundreds of times faster than Devin’s at this point.

In the Python version, the interpreter overhead dominates the scene. NumPy barely has time to spin up before the test finishes. A performance catastrophe to any observer watching from the outside.

But as the matrix size grows (100×100, 1000×1000, 5000×5000) something strange happens.

Clara’s performance curve begins to flatten very fast. Devin’s, however, starts to rise dramatically. At 10,000×10,000, the race isn’t even close anymore. Devin’s NumPy version finishes in seconds, while Clara’s code takes a few minutes. The crowd is absolutely stunned.

“How is that even possible?” asks the judge. “Python should be dozens of times slower than C!”

The answer actually lies beneath the surface, in the invisible layers between math and machine.

The Hidden Layers of Efficiency

When Devin writes C = A @ B, he’s not really asking Python to do the work. That single line is a front: this is essentially a polite request passed down to highly optimized native libraries like BLAS (Basic Linear Algebra Subprograms) or Intel MKL, written in assembly, fine-tuned for decades by experts.

In other words, Devin’s Python program is just an elegant wrapper around Clara’s nightmare of hand-written optimization, except that it leverages the work of hundreds of performance engineers, all embedded in the NumPy library.

Devin’s one-liner didn’t beat Clara’s code because Python is faster than C.
It won because the efficiency of an algorithm is more than the efficiency of the language.

It’s the efficiency of translation: from your intent, to the layers of abstraction, to the machine instructions that actually execute.

This brings us to the real question:
When we talk about the “speed” of a programming language, what do we actually mean?

Measuring the Unmeasurable

At first glance, we might think we can measure programming language speed directly: run the same algorithm in two languages and see which one finishes first.

But as we just saw, this approach is actually misleading. A programming language isn’t just a performance surface: it’s a stack of abstractions, each adding overhead but also expressiveness.

The higher you go, the more the machine understands what you mean, not just how you want it done.

Let’s now take some time to formalize this intuition.

Suppose we define a function ( T L(N) ), the time it takes a program written in language ( L ) to multiply two N×N matrices.

At the lowest level, there’s a constant ( αL ) that represents the language and runtime overhead: compilation, interpretation, garbage collection, etc.

But the dominant term comes from the algorithm’s inherent complexity: typically ( O(N^3) ) for naive matrix multiplication.

So we can express the runtime roughly as:

TL(N)=αL⋅N3+βL⋅N2+γL

At small ( N ), (γL ) and ( βL ) dominate, so C or another language like Rust will always crush Python or JavaScript.

But as ( N ) grows, the asymptotic behavior takes over, and the constants start to fade into irrelevance.

That’s when abstraction, and better algorithms, begin to win.

When Mathematics Beats Optimization

Now let’s imagine a third competitor: let’s call her Elena, a very smart mathematician who doesn’t even bother coding.

Instead, she randomly walks up to the stage and says, “Why are you multiplying matrices in O(N³) time at all?”

Everyone stares.

Elena, then moved by curiosity, introduces to the public the Strassen’s algorithm, a divide-and-conquer approach that multiplies matrices in roughly O(N^2.81) time by cleverly reducing redundant operations.

Her insight? You don’t always need to multiply every element. By reusing intermediate results, you can reduce the number of recursive multiplications from 8 to 7 per block.

That single change changes everything, this time for real.

When the tests run, Elena’s implementation (running on top of the same optimized BLAS routines) leaves both Clara and Devin in the dust. Her code doesn’t just rely on fast execution: it relies on fewer operations.

In other words:
Even the best compiler in the world can’t beat better mathematics (or better mathematicians).

The Evolution of Matrix Multiplication

Strassen’s insight in 1969 was more than a clever trick: it was a complete revelation.

Until then, most computer scientists believed the O(N^3) bound for matrix multiplication was fundamental and very difficult to optimize. But Strassen showed that structure can beat brute force.

Strassen’s Algorithm (1969)

Let’s recall the naive approach: for two ( N×N) matrices ( A ) and ( B ), each element (cij) in the result matrix ( C ) is given by:

\( c_{ij} = \sum_{k=1}^{N} a_{ik} b_{kj} \)

That’s ( N^3 ) multiplications and roughly ( N^3 - N^2 ) additions.
Strassen realized that by cleverly regrouping terms, you could reduce eight recursive multiplications to seven when dividing the matrix into 2×2 blocks.

He defined seven intermediary products:

\( \begin{aligned} M_1 &= (A_{11} + A_{22})(B_{11} + B_{22}) \\ M_2 &= (A_{21} + A_{22})B_{11} \\ M_3 &= A_{11}(B_{12} - B_{22}) \\ M_4 &= A_{22}(B_{21} - B_{11}) \\ M_5 &= (A_{11} + A_{12})B_{22} \\ M_6 &= (A_{21} - A_{11})(B_{11} + B_{12}) \\ M_7 &= (A_{12} - A_{22})(B_{21} + B_{22}) \end{aligned} \)

Then combined them to reconstruct ( C ):

\(\begin{aligned} C_{11} &= M_1 + M_4 - M_5 + M_7 \\ C_{12} &= M_3 + M_5 \\ C_{21} &= M_2 + M_4 \\ C_{22} &= M_1 - M_2 + M_3 + M_6 \end{aligned} \)

This reduced the computational complexity to:

\( O(N^{\log_2 7}) \approx O(N^{2.81}) \)

It may not sound dramatic, but for large ( N ), that’s a revolution: we’re talking about a 10× or 100× improvement in practical runtime. And it cracked open the idea that maybe, just maybe, O(N^3) wasn’t fundamental after all.

Coppersmith–Winograd (1987 → 1990 → 2012)

Nearly two decades later, Don Coppersmith and Shmuel Winograd extended Strassen’s dream.
They found new algebraic identities that could express matrix multiplication with even fewer scalar multiplications.

Their early bound reached:

O(N^{2.376})

Later refinements, particularly those by Stothers (2010), Vassilevska Williams (2012), and Le Gall (2014), pushed the exponent slightly lower, to around:

O(N^{2.3728596})

But there was a catch: these algorithms are, in reality, theoretical.
They rely on recursive tensor constructions so deep and memory-hungry that, beyond a few hundred dimensions, they become impractical.

Still, they established an existence proof: the universe of possible matrix multiplications is way richer than we previously ever imagined.

If Strassen cracked the door, Coppersmith and Winograd showed us there’s a gigantic cathedral behind it.

AlphaTensor (DeepMind, 2022)

Fast-forward to the 21st century.
In 2022, DeepMind published AlphaTensor, a reinforcement-learning system that rediscovered and then surpassed known algorithms for matrix multiplication, all automatically.

AlphaTensor treated the problem as a single-player game:
given a 3D tensor encoding the multiplication structure, find a decomposition that minimizes the number of scalar multiplications.

It searched this enormous combinatorial space using deep reinforcement learning, guided by patterns learned from prior solutions.

The result? New algorithms that beat human-designed ones: not just theoretically, but on real hardware.

For example, on NVIDIA V100 GPUs, AlphaTensor found a decomposition that achieved a 10–20% speedup over the best human-derived algorithms for specific matrix sizes.

Its core idea was to represent matrix multiplication as a tensor decomposition:

\(T_{ijk} = \sum_{r=1}^{R} U_{ir} V_{jr} W_{kr} \)

where ( R ) represents the rank; effectively, this is the number of scalar multiplications.
Minimizing ( R ) is equivalent to minimizing the algorithmic complexity.

Strassen, Coppersmith, and Winograd all did this analytically; AlphaTensor did it experimentally, by exploring billions of potential decompositions.

What This Evolution Teaches Us

The pattern is clear: as mathematics becomes more abstract, our tools for optimization become more powerful, and more general.

Abstraction, paradoxically, is what allows deeper efficiency.
It’s not the enemy of performance; it’s the key to discovering its next frontier.

Abstraction as Amplifier

Here’s the deeper lesson: abstraction doesn’t just slow things down: it can amplify efficiency when used correctly.

Devin’s choice of NumPy wasn’t about laziness; it was about leverage.
He stood on top of layers of mathematical and engineering intelligence (BLAS, vectorized memory access, cache-blocking algorithms) all abstracted behind one line of code.

Abstraction lets us encode intent, not procedure.
That’s a very profound shift.

When we tell the computer what we want instead of how to do it, we open the door to automatic optimization, so to compilers and runtimes that can rearrange, parallelize, and offload our computations to GPUs or distributed systems without us writing a single loop.

This is why the most efficient programming languages today, like for example Julia, Rust, even modern Python with JAX, aren’t the ones closest to the metal.

They’re the ones that translate mathematical intent into efficient machine behavior.

The Efficiency Frontier

Let’s pause and think more formally.

There’s an invisible frontier, a boundary between what’s limited by hardware and what’s limited by algorithmic structure.

We can represent this as two components of total runtime:

T(N) = T {algorithm} (N) + T {implementation} (N)

T {algorithm}(N)) depends on the mathematical method used [e.g., ( N^3 ), ( N^(2.81) ), or even ( N^(2.37)] for modern matrix algorithms like Coppersmith–Winograd.
T {implementation}(N)) depends on compiler efficiency, memory layout, caching, and CPU instruction parallelism.

Optimization at the implementation layer gives you constant-factor speedups.
Optimization at the algorithmic layer gives you exponential improvements as the problem size grows.

This distinction is not just academic: it’s existential for computing.
A new algorithm changes the shape of the future.

Intrinsic limits of Compilers

Why, then, can’t compilers just figure this out automatically?
Why can’t they look at a triple-nested loop and say, “Ah, that’s matrix multiplication, I’ll switch to Strassen’s algorithm”?

Because understanding what a program means is one of the hardest problems in computer science.
It’s the same reason compilers can’t always auto-parallelize loops or optimize arbitrary recursive functions.

Determining what code does, in the general case, is undecidable, a result proven by Alan Turing himself. No compiler can universally predict the runtime behavior of arbitrary code. This is the Halting Problem, in disguise.

So while compilers can optimize instruction scheduling and memory usage, they can’t discover better mathematics for you.

That leap still belongs to human insight, but increasingly, to machine learning systems trained on patterns of mathematical structure.

The Rise of Learned Optimization

Interestingly, the line between human and machine mathematical insight is starting to blur.

Projects like AlphaTensor (from DeepMind) have already discovered novel, more efficient algorithms for matrix multiplication, ones that outperform even Strassen’s on specific hardware architectures.

These systems use reinforcement learning to explore the space of possible decompositions of matrix multiplication tensors, finding new factorizations that minimize computational cost.
In other words, machines are now learning better math.

This is where abstraction, mathematics, and machine learning converge:
by describing computation in high-level, algebraic terms, we give AI the vocabulary to optimize at the algorithmic level, and not just at the machine code level.

If Clara represented control, and Devin represented abstraction, AlphaTensor represents autonomous reasoning, which is a new level of compiler that doesn’t just translate intent but discovers new forms of intent.

Curse of Dimensionality

While we’re talking about matrices, we can’t ignore their darker side.

As matrix dimensions grow, so do their computational and storage costs, exponentially.
This is the curse of dimensionality, the same phenomenon that makes high-dimensional data nearly impossible to process efficiently using naive methods.

For example, a 10,000×10,000 matrix has 100 million elements. At 8 bytes per double-precision number, that’s nearly a gigabyte of memory. A 100,000×100,000 matrix? roughly 80 gigabytes.
And that’s just to store it.

This is why modern computing has moved toward sparse representations and tensor factorization, because those are mathematical strategies that exploit structure and redundancy to reduce the explosion of data.

It’s also why programming languages now integrate linear algebra abstractions directly into their design, because efficiency isn’t just about speed, it’s about representation.

Programming Languages as Algebra

The most profound shift in recent years has been the fusion of mathematics and programming itself.
Languages like Julia treat code as symbolic mathematics: you can define an expression like A * B and the compiler can reason algebraically about it, optimizing across data types, hardware, and even distributed nodes.

This idea (that programming languages are a form of applied algebra) is not new, but it’s finally reaching maturity.

Every time you vectorize a computation, use a matrix library, or employ automatic differentiation, you’re exploiting algebraic structure encoded in the language. The runtime doesn’t just execute your code; it transforms it.

The best modern compilers are, in a sense, algebraic engines, rewriting your program into equivalent but more efficient mathematical forms.

The Moral of the Story

By the end of the Computeopolis challenge, the judges declare all three winners.

Clara, for mastering the art of control.
Devin, for trusting abstraction.
And Elena, for transcending both with better mathematics.

The true frontier of computing lies at the core intersection of all three.

The next time someone says “Python is slow”, remember the lesson of the matrix. Languages don’t define performance, layers of meaning do.

A single line of elegant code can mobilize decades of mathematical genius, crossing abstraction layers from Python to C to silicon, running optimized vectorized kernels across thousands of cores, all because you wrote A @ B.

We don’t win by fighting the machine for control. We win by teaching the machine what we mean.

That is the enduring lesson of modern computing:
The closer our code mirrors mathematics, the faster it can become.

Share The Software Frontier

The Software Frontier

Discussion about this post

Ready for more?