Can AI Truly Reason? Part 1

Our Journey Exploring the Limits of Machine Intelligence, Computing Logic, and the Future of Hybrid AI Systems

Lorenzo Bradanini and Lorenzo Tettamanti

May 08, 2025

Why LLMs Still Can’t Think Like Aristotle—Or Even Plato

Two towering figures from ancient philosophy still shape how we model our understanding of the world (and of course, intelligence) today: Plato, the idealist who believed in innate knowledge and completely abstract forms, and Aristotle, the empiricist who codified logic and grounded knowledge in observation. For centuries, this duality, between the abstract and the concrete, the a priori and the empirical, the ideal and the observed, has defined the contours of science, mathematics, and epistemology.

Entire generations of promiment philosophers and mathematicians have tried to resolve this tension, often hoping to either unify these worldviews or prove one correct over the other, unsuccessfully.

René Descartes leaned toward a Platonic rationalism, trusting reason more than the senses. David Hume, by contrast, questioned causality itself and pushed empiricism to its skeptical extreme. Immanuel Kant attempted a monumental synthesis: he argued that while knowledge begins with experience (an Aristotelian notion), the mind imposes innate structures upon it, echoing Plato’s forms in a modern cognitive framework.

In mathematics, the divide takes on an even sharper edge. Kurt Gödel, a modern-day Platonist, also a self-defined “Philosophical realist”, believed in an objective mathematical reality that exists independently of human minds, truths waiting to be discovered, not invented. His incompleteness theorems, however, showed that no formal (read: Aristotelian) system of logic could capture all truths about arithmetic.

Bertrand Russell and Alfred North Whitehead, along with their Principia Mathematica, tried to reduce all of mathematics to logical axioms, only to be blindsided by Gödel's results. David Hilbert, one of the greatest formalists ever lived during XIX century, dreamed of a complete and consistent foundation for mathematics, a dream Gödel shattered brutally.

Mind, Machines, and the Limits of Computation

Even in the 20th century, figures like Wittgenstein grappled with the implications of language, logic, and meaning, shifting from early idealism to later pragmatism. And in more modern times, prominent figures like Roger Penrose has argued that human understanding (particularly mathematical insight) transcends computation, suggesting there’s something non-algorithmic and perhaps irreducibly Platonic about our minds, going even further, saying that consciousness cannot be reduced to mere computation.

In The Emperor’s New Mind, he famously claimed that “human understanding is not algorithmic,” pointing to Gödel’s incompleteness theorems as evidence that we can intuitively grasp truths that no formal system can prove. For Penrose, this hinted at something deeper, an intrinsic gap between human reasoning and what any Turing machine could simulate.

He later went even further. In Shadows of the Mind, Penrose proposed that consciousness might arise from quantum processes inside the brain, particularly within neuron microtubules. With anesthesiologist Stuart Hameroff, Penrose developed the Orchestrated Objective Reduction (Orch-OR) theory: “conscious moments are the result of quantum state reductions, orchestrated within the brain’s microtubular structure.”

In other words he sees consciousness, and perhaps even the root of human reasoning, deeply dependent on non-computable physics, something fundamentally outside the domain of current AI models.

However, the theory has faced significant criticism from both neuroscientists and AI researchers. Among the most vocal critics is Ray Kurzweil, futurist and author of The Singularity Is Near. Kurzweil dismisses Orch-OR as speculative and unsupported by neuroscience or physics. He argues that:

“There’s no evidence that quantum effects play a significant role in the operations of the brain. Neurons are too warm and noisy for coherent quantum states to be maintained.”

Kurzweil maintains that consciousness and reasoning are emergent phenomena from complex classical computations—nothing more mysterious than what can eventually be simulated by machines. He views Penrose’s appeal to quantum mechanics as an unnecessary detour, writing that:

“Penrose is looking for magic where none is needed. Consciousness is not a quantum phenomenon—it’s a software phenomenon.”

Kurzweil also critiques Penrose’s use of Gödel’s incompleteness theorems, claiming that humans aren’t exempt from formal limitations and often arrive at incorrect conclusions, hardly proof of non-computable insight.

While deeply controversial and unproven, Penrose’s view remains one of the most radical and enduring challenges to the idea that intelligence can be fully captured by algorithms. As he put it:

“There must be something in human consciousness that escapes algorithmic description, something that physics has yet to uncover.”

Despite these titanic efforts, no one has succeeded in definitively settling the question. Is knowledge discovered or constructed? Is the mind a mirror to an abstract realm of truths, or a machine that shapes perception through evolved heuristics? This unresolved tension between Platonic and Aristotelian views is not just a philosophical quibble. it is one of the deepest open questions that shapes how we understand reality itself.

Today, it reemerges with new urgency in the age of AI.

The Age of “Conscious” Machines

As we are trying to build machines that mimic thinking, we are, once again, forced to confront ancient questions. Are these machines just pattern recognizers—Aristotelian in spirit, deriving rules from data? Or can they grasp something closer to Platonic truth, reasoning abstractly, correctly, and universally?

This debate isn’t just academic. It cuts to the heart of what artificial intelligence is capable of. And as we’ll see, even our most advanced models still fall short of resolving it.

In more modern terms, their debate lives on in the tension between deductive reasoning (drawing conclusions from known truths) and inductive reasoning, inferring general principles from patterns in experience.

Artificial Intelligence, surprisingly or not, has inherited this philosophical divide.

In 2024 and 2025, a new wave of AI models began to reshape what we mean by “intelligent machines.” It all started with OpenAI’s o1, a model trained to “think out loud,” emulating the step-by-step deliberation that Aristotle might have admired.

Then, all of a sudden, came DeepSeek’s R1, a Chinese-developed model that stunned researchers with its performance on logical and mathematical tasks. Now, we’re watching a full-on reasoning arms race: Google, Meta, OpenAI (o3), and others are scrambling to roll out their own “reasoning-first” systems.

These aren’t just faster or bigger models. They represent a whole qualitative shift: a move toward machines that seem capable of structured thought, not just pattern prediction. They perform way better on tasks requiring planning, logic, and multiple steps of inference.

To achieve this, they had to sacrifice one thing, speed: many now use extensive internal computation during inference, sometimes taking minutes to generate a single answer. But the trade-off seems worthwhile: fewer hallucinations, better argumentation, and improved consistency.

Yet, despite their apparent sophistication, these models fall short of both Aristotle’s logical rigor and Plato’s idealized Reason.

While reasoning-enhanced LLMs may seem closer to Plato’s philosopher-king—capable of grasping abstract truths—they are still fundamentally inductive systems, in a limited way. They predict the next word in a sequence based on statistical patterns in data, not the application of logical inference rules. And though they often mimic deductive steps (especially when prompted to "think step by step"), their reasoning is brittle, unreliable, and prone to subtle contradictions, often difficult to understand or find.

Formal reasoning (mathematics, symbolic logic, and provably correct argumentation) remains elusive.

This leads us to a key question: Are we on a straight path to flawless machine reasoning? Or are we seeing the limits of what language models can do—even with more data, compute, and clever prompting?

In this article, we argue that there are fundamental architectural limits to what large language models can achieve in terms of formal reasoning. These limits arise not only from insufficient training, but instead from the inherent properties of their design: a fixed computational graph, a stochastic prediction mechanism, and the absence of an explicit symbolic reasoning engine.

Current LLMs seems to behave in a way that can neither reason with core principles attributed to Aristotle’s, nor grasp the ideal forms of Plato. At best, they simulate the appearance of reasoning; at worst, they produce confident nonsense.

To be clear: here we are not saying that AGI is impossible at all. We actually believe that machines could eventually match or exceed human intelligence, and LLMs are likely a foundational piece of that puzzle. But we are pretty skeptical of the claim that simply scaling current basic architectures—more tokens, more GPUs, longer context—will get us there.

In the rest of this article, we’ll explore why large language models still struggle with formal reasoning, examine the most compelling counterarguments, and outline what a future architecture might require if we want machines to reason not just statistically, but correctly.

This piece is part of a broader series that spans the deep frontiers of machine intelligence—touching on physics, logic, advanced mathematics, computer architectures, economics, laws of scale, socioeconomic incentives, and enduring philosophical questions about the nature of reasoning and consciousness itself.

It’s a long read—but if you’ve ever wondered whether machines can truly think, this serie’s for you.

From Plato to GPT: The Long Arc of Reasoning

The evolution of reasoning—from its philosophical origins to its formalization in science and computation—offers a rich lens through which to assess the capabilities and limitations of artificial intelligence. At its core, reasoning has historically been divided into two main branches: deductive and inductive, each reflecting distinct epistemological commitments and different methods of inference.

Plato, writing in the fourth century BCE, emphasized rationalism, proposing that knowledge arises from immutable, abstract Forms apprehended through reason alone. For him, the physical world was just an imperfect copy, and logical deduction was the only path that humans had in trying to pursue true knowledge.

Aristotle, his student for some twenty years, introduced a more grounded framework, completely rejecting Plato’s theory of forms: he developed what it is called “formal logic”—most famously, the syllogism—and championed empirical observation as a necessary component of knowledge formation. This marked the origin of the empiricist tradition, where reasoning is built on generalizations from experience—what we now call induction.

These foundational ideas would be refined and systematized over two millennia.

The Rise and Fall of Rationalist Certainty

During the Scientific Revolution, brilliant thinkers like Francis Bacon, formalized and then promoted what was later called “inductive reasoning” as the basis for scientific inquiry, while René Descartes reasserted the primacy of deductive rationalism, starting with general principles or axioms and then arriving at specific conclusions through logical steps, methodic doubt, often using analytic geometry.

The Enlightenment witnessed the peak of deterministic optimism in Pierre-Simon Laplace, whose vision of the universe—governed by immutable laws and entirely predictable given sufficient knowledge—epitomized deterministic reasoning. Laplace’s hypothetical Demon, capable of knowing all positions and velocities of particles, symbolized the dream of a perfectly knowable (and deterministic) universe governed by deductive causality. This vision found scientific expression in Newtonian mechanics, Euclidean geometry, and later in formal systems like Peano arithmetic and the basics of set theory.

However, the limits of these systems became evident in the 20th century. David Hume had already questioned the philosophical foundation of induction in the 18th century, arguing that no empirical regularity can logically guarantee future regularity—a critique that remains unresolved even today.

Then Immanuel Kant attempted, with partial success, to synthesize rationalism and empiricism through synthetic a priori judgments. He argued that while knowledge begins with experience, the mind actively shapes that experience using inherent cognitive structures like space, time, and categories such as causality. Unlike analytic judgments (e.g., "All bachelors are unmarried"), synthetic a priori judgments (e.g., "Every event has a cause") are not derived from experience but are necessary for understanding the world.

Kant’s approach aimed to preserve the certainty of mathematics and science, while acknowledging the role of experience. However, Kant also proposed that the ultimate nature of reality—the noumenal world—remains unknowable, suggesting limits to human reason that later philosophers and mathematicians, including Gödel, would further explore.

Undecidable statements were left behind by Kant, merely classified as “noumena” in this philosophical theory, thus there were clearly some major formalization issues.

Even Laplace’s Demon, the hypothetical intelligence that could predict the future given perfect knowledge of all particles' positions and velocities, was fundamentally confuted in XX century, by quantum physics and later discoveries that essentially pointed out the universe as a more probabilistic (rather than deterministic) entity.

The Heisenberg Uncertainty Principle forbids simultaneous exact knowledge of position and momentum, while quantum indeterminacy introduces true randomness into physical processes, making outcomes probabilistic rather than deterministic. Even attempts to observe a system alter its state, meaning that complete prediction is not just impractical, it’s physically impossible.

This part is pretty important, so we will discuss it in its dedicated chapter below.

Early Concepts of Undecidability

Long before Gödel, thinkers glimpsed that some truths might lie beyond formal proof. In geometry, for example, 19th‐century mathematicians discovered that Euclid’s fifth postulate (the Parallel Postulate) could never be derived from the other axioms.

Lobachevsky, Bolyai and even Gauss realized its independence, and Beltrami (1868) finally proved the parallel postulate is independent of Euclid’s other axioms. In philosophy, Kant had argued in the Critique of Pure Reason (1781) that geometric truths are not “analytic” deductions from definitions, but require pure intuition – he famously said geometry is “undecidable by analytic means”.

In effect, Kant saw Euclidean geometry as a synthetic a priori system whose theorems cannot be derived solely by logic. These early insights, geometry requiring extra-intuitive assumptions, anticipate the modern idea of a true statement not provable from given axioms.

By the early 20th century, the stage was set for formalizing such ideas. Hilbert’s formalist program assumed arithmetic could be made complete, but some of his colleagues already suspected otherwise. In 1928 Paul Bernays and Alfred Tarski explicitly discussed the possibility that set theory might be incomplete (undecidable) and that some propositions could escape proof.

Even computationalists like John von Neumann, directly influenced by Gödel’s work, also conceded that logic and mathematics might not be fully decidable, while Brouwer (1928) held that mathematics is “inexhaustible” and cannot be completely formalized. Thus by the late 1920s the notion of incompleteness – true statements beyond formal derivation – was circulating as a suspicion, but no precise theorem yet existed.

Gödel’s Formalization of Undecidable Sentences

Kurt Gödel brought full rigor to these hints in 1931. He showed that in any consistent formal system (powerful enough to encode basic arithmetic), one can explicitly construct a statement G that essentially says “G is not provable in this system.” By assigning numbers to symbols (Gödel numbering) and applying a diagonalization (fixed-point) lemma, Gödel built this self-referential sentence.

His First Incompleteness Theorem then proves that if the system is consistent, neither G nor its negation can be proved within the system. Equivalently, G is true (in the intended model of arithmetic) but unprovable by the axioms.

In fact, as Wikipedia beautifully summarizes: “for any such consistent formal system, there will always be statements about natural numbers that are true, but that are unprovable within the system”. Gödel’s proof is constructive: it exhibits a particular undecidable sentence (the Gödel sentence for the system).

Gödel’s Second Incompleteness Theorem goes further: it shows that no such system can prove its own consistency. In modern terms, the statement “System P is consistent” cannot be proved inside P itself (assuming P really is consistent). In Gödel’s own words (as translated), “the consistency of P, if P is consistent, cannot be established by a proof within P”.

This was a shock to Hilbert’s program: one cannot complete an axiomatic theory by proving its consistency from within. In both theorems Gödel eliminated any appeal to vague notions of truth, instead using purely syntactic notions like “provable formula” and requiring only the (weak) assumption of consistency of the system.

Gödel vs. Earlier Proposals

Was Gödel the first to define “true but unprovable” rigorously?

In spirit, hints had appeared earlier, but Gödel was the first to produce a precise mathematical demonstration. As noted, Bernays and Tarski (1928) discussed incompleteness abstractly, and von Neumann raised the possibility that Entscheidungsprobleme (decision problems) might fail.

Brouwer even argued informally that no formal system can exhaust all mathematical truth. But these were expectations or intuitions, not formal results.

Gödel’s work introduced concrete formal machinery, like rithmetization of syntax, a provability predicate, and the diagonal lemma, to produce an undecidable statement. In that sense Gödel invented the mathematically rigorous notion of an undecidable (or true‐but‐unprovable) sentence.

Later logicians (Tarski, Church, Turing) showed other limits (undefinability of truth, unsolvability of the halting problem, etc.), but they followed Gödel’s lead. No formal system is truly complete – a fact Gödel proved for the first time.

Kant’s Noumenal Limits vs. Gödel’s Undecidability

Kant’s notion of the noumenon is philosophically analogous but not totally identical to Gödel’s limit. Kant distinguished phenomena (the world as we perceive it through space, time, and categories of thought) from noumena (things-in-themselves, independent of our perception).

He held that we can never have knowledge of noumena – they are in principle unknowable. As Kant put it,

“the objects we intuit in space and time are appearances, not objects that exist independently of our intuition… We can only cognize objects in space and time, appearances. We cannot cognize things in themselves”.

In other words, human reason has an epistemic boundary: there may be truths about reality that lie forever beyond our cognitive grasp.

Gödel’s theorems highlight a somewhat similar boundary for mathematics: in any given formal system, there are arithmetic truths that the system cannot reach.

In one analogy, the formal axioms+rules of a system are like Kant’s “forms of intuition and categories,” which shape what we can know. Gödel’s undecidable sentences are more like mathematical “noumena” in that they are not accessible from inside the system.

However, unlike Kant’s noumena, Gödel’s sentences are concretely defined and can often be recognized as true by reasoning outside the system (for example, by assuming the system’s consistency and carrying out the meta-proof).

They are not metaphysically beyond all knowledge – they simply lie beyond that system’s deductive power. In fact, in a stronger system or with additional axioms, one of these sentences might become provable.

Thus, Gödel’s undecidable propositions are not noumena in the full Kantian sense (we can “see” them by metamathematical reflection), but they serve as a mathematical mirror to Kant’s epistemic limits. Both ideas underscore that our methods of knowing leave some truths out of reach.

In sum, Kant anticipated the theme of inherent limits to reason, and Gödel translated that theme into a precise mathematical context.

The Philosophical and Cognitive Divide

Going back (or further, if you like) to our timeline, we encounter Karl Popper and later Thomas Kuhn, who restructured the philosophy of science around falsifiability and paradigm shifts. Popper introduced the concept of falsifiability as the demarcation criterion for scientific theories, arguing that for a theory to be considered scientific, it must be testable and, in principle, refutable.

This perspective shifted the focus of scientific reasoning away from strict deductive logic and toward a probabilistic and provisional approach, where theories are continuously subjected to testing and revision.

Thomas Kuhn, building on Popper’s ideas, argued that scientific progress doesn’t merely involve gradual accumulation of knowledge. Instead, it occurs through paradigm shifts—fundamental, sometimes revolutionary changes in the frameworks that guide scientific thought, even by accident. These shifts mark moments when a dominant theory or worldview is replaced by a radically different one.

Kuhn's work further reinforced the idea that scientific reasoning is not purely objective or deductive, but inherently influenced by the paradigms and theories scientists hold, making it probabilistic, provisional, and theory-laden.

In cognitive science and neuroscience, this divide finds echoes in the dual-process theory of reasoning, hugely popularized by nobel prize winner Daniel Kahneman, which distinguishes fast, intuitive, inductive “System 1” processes from slow, deliberate, deductive “System 2” thinking.

The brain’s neocortex, central to human reasoning, appears to implement hierarchical Bayesian inference, constantly updating probabilistic models of the world in a fundamentally inductive architecture. This mechanism allows us to generalize from experience, make predictions under uncertainty, and adaptively revise our beliefs in light of new evidence.

However, there is also a crucial deductive component embedded in human cognition. Our brains are capable of constructing internal symbolic representations, manipulating abstract rules, and following structured chains of logical inference, especially in tasks involving mathematics, language syntax, planning, and problem-solving.

This deductive capacity is typically associated with “System 2” in dual-process theories of cognition: slow, effortful, and rule-based reasoning. It underpins our ability to engage in explicit argumentation, perform mental simulations, and evaluate counterfactuals.

While inductive inference gives us flexibility and adaptability, deduction provides rigor, coherence, and truth-preserving mechanisms. Human reasoning, therefore, is not merely probabilistic; it is hybrid—leveraging both fast, pattern-based generalizations and slow, structured logical computation depending on context and cognitive demand.

The Free Energy Principle, formulated by Karl Friston, provides a unified account of perception and action as an optimization of probabilistic beliefs under sensory uncertainty.

Going forward, contemporary AI systems, especially large language models, inherit an inductive lineage. They do not perform deduction in the formal sense; rather, they approximate it through statistical learning.

These models generalize from vast corpora of text, learning to predict patterns that often resemble logical reasoning but lack any truth-preserving mechanism. There is no embedded formal system, no syntactic proof verification, no internal concept of axioms or semantic entailment.

Unlike a theorem prover or proof assistant (such as Lean, Coq, or Isabelle), which encode formal systems and derive truths through mechanically verifiable steps, LLMs operate in the realm of plausibility, not necessity. They may generate text that appears deductively valid, but they do so without understanding or guaranteeing logical coherence.

Thus, the philosophical tension between deduction and induction not only shaped centuries of human thought—it now defines the boundary between human and artificial reasoners. Until AI systems internalize the rigor of Gödel’s formalism, Aristotle’s logic, and the structured determinism envisioned by Laplace, they remain inductive engines: powerful, predictive, and compelling, but ultimately syntactic mimics of a logic they cannot yet embody.

Do LLMs Reason or Predict?

The evolution of our philosophical understanding about reasoning, from Platonic ideals and Aristotelian logic to Gödel’s formal limits and Friston’s probabilistic brain, provides not just a chronology of intellectual development, but a framework to interrogate the nature of machine intelligence today.

Reasoning, in its many guises, has always been the scaffolding of knowledge: an attempt to move from what is known to what can be justified. In philosophy, reasoning was the means to understand truth; in science, the tool to uncover laws of nature; in computation, the actual engine of inference. As artificial intelligence enters center stage, it inherits this legacy—but also its fractures. When we say that Large Language Models (LLMs) “reason,” we must ask first: in which sense of the word? Are we witnessing genuine inference?, or a statistical echo of centuries of structured thought?

Reasoning: A Loaded and Layered Term

The term “Reasoning” is a deceptively simple word. In everyday language, it can mean justifying a decision (“I didn’t go out because it was raining”), solving a riddle, or understanding someone’s emotions by simulating their point of view. In philosophy, reasoning touches on the core of epistemology—what we know and how we justify it.

In psychology and cognitive science, it spans unconscious heuristics (like availability bias) to slow, deliberative problem-solving, as modeled by Daniel Kahneman’s System 1 and System 2.

In computer science, “reasoning” narrows further to mean syntactic manipulation within formal systems—symbolic logic, theorem proving, rule-based inference—where every step is transparent, traceable, and truth-preserving, in the whole process.

In short, every discipline operationalizes reasoning differently. This creates ambiguity when we ask whether large language models “reason.” Do they simulate it, perform it, or merely appear to? Without grounding the term, we risk confusing metaphorical reasoning (“it sounds logical”) with formal reasoning (“it adheres to rules of inference”).

Consider the difference between ChatGPT completing a syllogism like All humans are mortal. Socrates is human. Therefore, Socrates is mortal (which it can do), versus proving a theorem in Peano Arithmetic, where each step must be justified within a formal axiomatic system.

So let me be precise: in this article, when we say LLMs cannot reason, I’m referring to formal deductive reasoning—the kind that underpins mathematics, logic, and rigorous scientific modeling. This is the type of reasoning where proofs are verified step-by-step, and logical soundness is guaranteed. As philosopher Wilfrid Sellars put it:

“the aim of philosophy is to understand how things in the broadest sense of the term hang together in the broadest sense of the term”

But formal logic is the method by which we prove that they do.

LLMs may generate plausible conclusions, but they do not derive them from other core principles. They simulate the surface of logic, not its structure.

A Narrow but Rigorous Definition of Reasoning

To evaluate whether LLMs can reason, we must adopt a definition of reasoning that is precise, computationally verifiable, and epistemically robust. That definition should be something like this:

Formal reasoning is the key ability to derive logically sound conclusions from explicitly defined premises by applying valid rules of inference.

This kind of reasoning underpins mathematics (e.g., deriving theorems from axioms), computer science (e.g., verifying programs or circuits), and various parts of philosophy (e.g., evaluating the validity of arguments). Within formal reasoning, deduction is the gold standard: from true premises and valid inference rules, it guarantees true conclusions. There is no statistical approximation here—only true certainty.

Induction, by contrast, is inherently uncertain. It involves generalizing from observations: “every swan I’ve seen is white, therefore all swans are white.” It is central to machine learning, scientific discovery, and human pattern recognition, but as Hume showed, no amount of inductive evidence guarantees truth.

This philosophical vulnerability is vividly illustrated by the inductivist turkey, a thought experiment popularized by Nassim Taleb: a turkey is fed every day and comes to believe (based on repeated observation) that it will always be fed.

But on the 1,001st day, just before Thanksgiving, it is slaughtered. The turkey’s conclusion was reasonable by inductive standards, yet catastrophically wrong. The lesson is clear: induction can only suggest likelihood, never certainty, and it becomes most dangerous when it seems most reliable.

What LLMs Actually Do

Large Language Models like GPT-4 or Claude are, at their core, massive statistical machines. They are trained to predict the next token in a sequence given a context—nothing more, nothing less. That’s all, folks. All the apparent intelligence—composing essays, writing code, answering questions, solving math problems—emerges from this deceptively simple mechanism of repetition and pattern completion.

Importantly, LLMs do not internally validate arguments, check proofs, or apply formal rules of logic. They do not encode axioms, nor do they operate through deductive inference systems like a theorem prover. When an LLM generates a proof, solves a riddle, or completes a logical syllogism, it is not deducing a conclusion from premises—it is statistically guessing what a plausible continuation of the sequence might look like, based on its training set and/or the available data.

The output might look like logic, but it is generated through inductive pattern matching, not deductive reasoning. And there is no internal mechanism to verify that the conclusion actually follows from the premises. Sometimes it does; often it doesn’t and crucially, the model itself cannot tell the difference, at least from its “perspective”.

But here’s the twist: neither can we, most of the time.

Humans, too, are primarily inductive reasoners, and this is mostly due to evolutionary reasons. We generalize from patterns in the world, not from axiomatic foundations. If we truly operated like formal logic machines, we would constantly freeze in some kinds of inherent “deductive uncertainty”, due to the massive calculations that we have to conduct to be almost certain of an event. In most cases, the number of parameters to look at are immense, and in some cases it’s impossible to count them, or even to calculate them.

Imagine if we refused to get out of bed because there is no logical proof the sun will rise tomorrow. Or if we refused to eat because we couldn't deductively prove the food wasn’t poisoned. Instead, we rely on inductive assumptions drawn from experience: the sun has always risen; the food looks and smells fine; cars usually stop at red lights.

These patterns form the basis of our confidence in everyday reasoning—not deductive certainty, but probabilistic expectation. This allows us to function in a world of uncertainty, but also leaves us more vulnerable to surprise—like the inductivist turkey.

So while LLMs differ from us in many ways, their reasoning (at least as statistical approximators) is not entirely alien to us. In fact, it echoes the same inductive shortcuts we evolved to survive. What separates us, for now, is our ability to reflect on reasoning itself—to build formal systems, question our inferences, and occasionally reach for deduction in the face of uncertainty.

Our capacity for self-reflection plays a pivotal role in human cognition, particularly in critical situations. This metacognitive ability allows us to evaluate our thoughts and actions, leading to the development of critical thinking. John Dewey, a prominent philosopher and educator, introduced the term "critical thinking" in his 1910 work How We Think, describing it as an "active, persistent, and careful consideration of any belief or supposed form of knowledge in the light of the grounds that support it" .

Studies have shown that reflective (deductive) thinking typically has a longer latency period compared to intuitive (inductive) responses. For instance, in a large-scale experiment involving 3,667 participants, various techniques were used to activate reflective thinking, revealing that such processes are generally slower but lead to more accurate outcomes.

Why Deductive Reasoning Matters

At this point, you might then ask: why should we care whether LLMs can really reason, as long as they appear to do so?

The answer is reliability and epistemic accountability. In high-stakes contexts—mathematics, scientific discovery, safety-critical systems, legal arguments, medical research etc…We need solid and evincible guarantees.

We have to know that a conclusion is correct because it follows from sound logic, not because it looks plausible. This is why we use proof assistants like Lean, Coq, or Isabelle in formal verification. They enforce deductive rigor.

LLMs, by contrast, are optimized for fluency and plausibility, and not for truth or validity. They are excellent tools for exploration, hypothesis generation, and summarization. But they lack the machinery of formal reasoning.

That’s not a criticism at all: it’s a design feature. But it does limit their use in domains where formal reasoning is non-negotiable.

Common Objection: “But Humans Aren’t Perfect Reasoners Either!”

This is a pretty popular reminder:

“Sure, LLMs aren’t capable of reasoning, but neither can humans—so what’s the big deal?”

The major flaw here lies in conflating fallibility with incapacity. Yes, humans are fallible reasoners. We are prone to cognitive biases, logical fallacies, and emotional interference. But the human brain can perform formal reasoning when trained, just as it can do calculus or play the piano. Mathematicians can prove complex theorems. Logicians can verify arguments. Even college students learn how to deduce conclusions from premises in symbolic logic courses.

More importantly, this line of argument is a familiar one—but it doesn’t hold up. As a species, we don’t excuse technological limitations by pointing to human ones, and rightly so. We don’t say, “Hey, this self-driving car crashed, but hey, humans crash too—so it’s fine.” We evaluate technology based on whether it performs better than existing solutions in the relevant context.

In deductive logic, we already have some pretty accurate tools that can reason rigorously. So when LLMs are proposed as replacements or supplements in such domains, the comparison must be to these tools, not to the average human.

Share The Software Frontier

LLMs as Statistical Engines of Inference

At their core, Large Language Models are trained to minimize next-token prediction error across massive datasets, a process rooted in maximum likelihood estimation. In other words, they generate the most statistically probable continuation of a given input, based on patterns in their training data. This is not a minor detail: it defines everything they’re capable to do.

Their outputs reflect learned correlations, not truths derived from axioms. They don’t encode logic rules, perform formal derivations, or verify proofs. When an LLM completes a logical argument or mathematical proof, it’s not following deductive steps: it’s assembling sequences of tokens that look like reasoning because similar patterns occurred frequently in the corpus.

In this sense, LLMs excel at inductive inference in the broadest, most statistical sense: drawing likely conclusions from incomplete or noisy data. They may simulate reasoning forms like modus ponens if those patterns are common, but they do not understand or apply inference rules. There’s no internal mechanism to ensure the conclusion follows from the premises—only a learned sense that such sequences often occur together.

This statistical strength makes LLMs incredibly powerful for tasks involving language fluency, analogy, and open-ended reasoning. They’re particularly adept at identifying patterns, summarizing content, generating creative continuations, and rephrasing inputs. All those tasks that benefit from broad contextual generalization rather than strict logical structure.

But it also introduces brittleness. In tasks that require internal model checking, consistency enforcement, or multi-step logical planning, LLMs often falter unless scaffolded. Their probabilistic outputs are simply not built to maintain deductive invariants over multiple steps.

Bridging the Gap: Hybrid Systems

This topic raises a natural question:

can we give LLMs genuine reasoning capabilities by augmenting them?

Yes! and that’s exactly where the field is heading. Hybrid systems combine LLMs with symbolic tools like theorem provers, SAT solvers, formal logic engines, or code interpreters.

In these architectures, the LLM serves as an interface: translating user queries, interpreting output, chaining steps, and selecting the appropriate tool. But the actual reasoning is delegated to the symbolic component.

This hybrid approach is promising. It enables the natural language fluency and heuristic power of LLMs to interface with systems capable of rigorous deduction and formal validation. Think of it as a division of labor: LLMs handle semantics and probabilistic generalization; symbolic engines handle structure and logical soundness.

But this also reaffirms the central point: LLMs on their own do not reason deductively. They mimic the surface structure of reasoning, but they do not internally implement it. If we want machines that reason as humans understand the term, we need to build systems that embed such reasoning, not just emulate it.

Conclusion

In this first exploration, we've traced the concept of reasoning across disciplines—philosophy, cognitive science, neuroscience, and artificial intelligence—and seen how Large Language Models fit into this broader landscape. We’ve clarified that while LLMs can produce outputs that resemble logical reasoning, their underlying mechanism is fundamentally statistical, not deductive. They do not apply inference rules, verify proofs, or operate within formal systems; they generate plausible continuations based on learned correlations in data.

This distinction matters. It helps us understand both the power and the limits of LLMs. They are remarkable at language generation, analogy, and probabilistic prediction, but fragile in tasks that demand rigor, consistency, or stepwise logic. As a species, we too reason inductively most of the time, but we’ve developed formal systems—mathematics, logic, critical thinking—to transcend the limitations of pure pattern recognition.

True reasoning (especially in the formal, verifiable sense) still lies beyond what LLMs can do alone. But hybrid approaches, where LLMs interface with symbolic engines, offer a compelling path forward.

These systems hint at a future where linguistic fluency and logical precision coexist, not in a single model, but in a modular architecture.

Ultimately, understanding the limits of artificial reasoning clarifies the strengths of human cognition, and the kinds of systems we must build if we want machines to not just imitate thought, but truly think.

The Software Frontier

Discussion about this post

Ready for more?