The Shape of Learning: A Deep Dive into PyTorch Fundamentals
Understanding Tensors, Devices, and the Computational Grammar Beneath Every Model
Why I Went Back to PyTorch Fundamentals
We talk a lot about models. About software architectures. About the dizzying speed at which AI advances: especially over the last 2 and a half years, when it felt like the world flipped a switch.
Suddenly, everyone was talking about AI models. As we all remember, OpenAI released ChatGPT in late 2022, and within weeks, it became a household name.
Founders rebranded overnight. Investors rewrote their pitch decks. Product teams scrambled to integrate some flavor of generative AI, even if it barely fit.
Valuations soared. Startups with little more than a wrapper around a foundation model were raising at billion-dollar clips. Every VC newsletter sounded the same: “The AI wave is here.” Everyone wanted to catch it. Build on it. Bet on it.
And yet, beneath the noise, the hype cycles, the demo days, the billion-parameter bragging contests, there’s something a bit quieter.
Something more elemental. Something every model is built on, every loss propagated through, every decision shaped by: tensors.
Before you can reason about vision transformers, LLMs, retrieval-augmented generation, or distributed training pipelines, you need to ask:
What exactly is PyTorch doing under the hood?
What is a tensor, really?
How does GPU acceleration fit into the story?
And how do we make sense of the shapes, the dimensions, the silent architecture behind every computation?
But I didn’t begin here.
It started, as most things do these days, with a side project. I wanted to experiment with a simple neural network to classify some data I had scraped, a kind of "hello world" for machine learning.
I had dabbled in scikit-learn, tried a few Keras tutorials, but every time I hit something non-trivial, it felt like I was fighting against the framework instead of learning the underlying mechanics.
Then I found PyTorch.
At first, it felt surprisingly raw. It didn’t abstract away as much. I had to write the training loop. I had to think about shapes. I had to manage the device my data was on. But instead of feeling overwhelming, it felt… honest.
Exposed in a way that made the learning real. I wasn’t just pushing buttons; I was watching the gears turn.
And once I started working with tensors, really working with them, I realized that I’d been skipping a fundamental layer of understanding. I had learned the tools, but not the terrain.
Why This Post Exists
This post is, in a way, me writing the guide I wish I had back then. Not a tutorial. Not a "do this, then that." But a map of the conceptual ground floor: the PyTorch workflow, from first principles.
Because before you can make sense of dropout or cross-entropy loss, you need to understand the logic of tensor operations.
Before you can evaluate accuracy, you need to understand what it means to propagate data through a model. Before you can optimize anything, you need to ask: what is being optimized?
In other words: if you're serious about learning deep learning, you can't skip this. So this post isn’t about building the next GPT or fine-tuning the latest model.
It’s about going deeper into the core mechanics: tensors, operations, data loading, gradients, GPU acceleration, model evaluation. The grammar of PyTorch. The architecture of thought behind the framework.
It’s about becoming fluent in the layer most people skip.
And maybe, by the end of this journey, you’ll see what I saw: that understanding tensors isn’t just a technical necessity: it’s a lens. One that makes the rest of the field clearer, more grounded, and far more powerful.
Let’s get started.
What Is PyTorch, Really?
When you first hear about PyTorch, it’s often in the context of something else.
Someone's around the globe is currently building a transformer. Someone else is training diffusion models. A startup raises $20M to fine-tune LLMs on PDFs.
And somewhere in there, PyTorch gets mentioned (quietly, almost casually) as the framework behind it all.
But PyTorch isn’t just a tool for training models. It’s a language for thinking in computation. It’s how raw data becomes structure, how equations become code, how gradients flow backward and ideas move forward.
Developed by Facebook’s AI Research lab, PyTorch didn’t win adoption because it had the best marketing: it won because it felt different.
It wasn’t trying to hide the details. It was trying to teach you something.
Unlike its early competitors, PyTorch didn’t force you to predefine a static graph. Every forward pass creates a new graph, shaped by your data, your logic, your experiment.
You could write loops, conditionals, and even recursion inside your model: and it just worked. It wasn’t just powerful. It was honest. You could see the mechanics.
But even that’s getting ahead of the story.
Because before you can build anything in PyTorch, before you write your first layer or train your first model, you need to understand the thing everything is built from.
The Hidden Terrain Beneath Every Model
Tensors are PyTorch’s native language.
If you’re coming from NumPy, they’ll look familiar: just multidimensional arrays. But underneath that surface, they’re so much more. Tensors aren’t just data holders.
They carry gradients. They live on devices. They track operations for autograd. They behave like scalars, vectors, or entire batches of examples.
They are what every model sees. What every layer transforms. What every loss function measures. What every optimizer updates.
And if you want to understand PyTorch, really understand it, you need to spend time here. Not skimming. Not copy-pasting. Learning the terrain.
The Grammar of Deep Learning
Let’s begin with the fundamentals: tensor dimensions.
They may sound abstract, but they govern the entire logic of your model.
Dimensions define the structure of your data: how many axes it has, how values are arranged across those axes, and how it flows through operations like linear layers, convolutions, or matrix multiplications.
If your input tensor doesn't match the expected rank or shape, layers won’t connect, dot products will fail, and backpropagation may silently break.
So ask yourself: Do I understand the exact shape of my tensors? Can I follow their transformation: from raw input, through embedding layers, activations, reshapes, and finally into the output prediction?
Data From Thin Air
PyTorch makes it easy to create tensors:
torch.tensor([1, 2, 3]) # From raw values
torch.rand(2, 3) # Random values in [0, 1)
torch.zeros_like(other_tensor) # Match shape
torch.arange(0, 10, step=2) # Sequence of values
Each tensor carries metadata with it:
.shape
: its size.ndim
: its rank.dtype
: its data type.device
: where it lives (CPU or GPU).requires_grad
: whether it tracks gradients
These aren’t technical details. They’re levers you’ll use constantly. PyTorch doesn’t hide this from you; it invites you to manage it.
Ask: Is my tensor on the right device? Using the right precision? Tracking gradients when I need it to?
Where Learning Happens
Everything you’ve heard about deep learning, which is about things like matrix multiplication, non-linearity, backpropagation, happens through tensor math.
a + b # Add
a - b # Subtract
a * b # Element-wise multiply
a / b # Divide
torch.matmul(a, b) # Matrix multiply
There’s elegance here. Simplicity. But it also hides complexity.
Take torch.matmul()
: it's just a function. But under the hood, it's the operation every layer relies on. It transforms inputs. It computes attention scores. It turns raw data into embeddings.
Don’t let the syntax fool you. This is the learning.
The Art of Tensor Manipulation
Sometimes, the math is fine, but the shape isn’t.
x.reshape(2, 3)
x.view(2, 3) # Shares memory
x.squeeze() # Removes 1-sized dimensions
x.unsqueeze(0) # Adds a new dimension at front
You’ll do this constantly.
CNNs want [batch, channels, height, width]
.
LSTMs expect [seq_len, batch, features]
.
Transformers want [batch, seq_len, embed_dim]
.
So you reshape. You unsqueeze. You flatten. Not because it’s pretty, but because the model expects a contract. And you need to respect it.
Indexing: How You Look Inside
Debugging models isn’t about error messages; it’s about peeking into tensors.
tensor[0] # First row
tensor[1, 2] # Specific element
tensor[:2, :2] # Top-left corner
This is how you ask: What does my model see right now?
Forget charts and dashboards. Sometimes all you need is to slice a tensor, print it, and stare. That’s how intuition forms.
Close Friends, Shared Memory
PyTorch plays nicely with NumPy:
torch.from_numpy(np_array)
tensor.numpy()
But be careful: these conversions share memory. Change one, and the other updates too. That can be super powerful… or disastrous.
Know what you’re sharing. Ask: Am I passing a view, or a copy?
Randomness and Reproducibility
Every training run is a bit different. That’s the nature of initialization, dropout, and stochastic gradient descent.
But randomness can become a trap if you don’t control it:
torch.manual_seed(42)
torch.cuda.manual_seed(42)
If your model behaves differently every run, you won’t know whether you improved it, or got lucky.
Reproducibility isn’t an afterthought. It’s part of understanding.
GPU Acceleration: Speed, with Care
This is the part where PyTorch begins to feel like real power.
You write a single line:
device = "cuda" if torch.cuda.is_available() else "cpu"
tensor = tensor.to(device)
And just like that, your tensor disappears from CPU memory and reappears on a GPU. Hundreds of cores now stand ready to parallelize your computation. A 20-minute operation can finish in seconds.
But here’s the thing: speed isn’t free.
The GPU is fast; but it's also picky. Memory is limited. Transfers between CPU and GPU are expensive. And most of the bugs that arise from using the GPU are subtle and quiet, your model might not crash, but it also might not learn.
Everything must live on the same device. The model, the inputs, the targets, the loss computation. If even one part stays behind on the CPU, you’ll get cryptic errors or, worse, silent slowdowns.
And there’s another catch: once your tensors are on the GPU, you can’t just print them or pass them to NumPy. You have to bring them back first. Detach, move to CPU, convert.
The lesson is simple but non-negotiable: don’t just hope for speed; manage it.
Know where your tensors are. Move them intentionally. Think in terms of memory, not just code.
Ask yourself:
Am I explicitly controlling the device for my data and model?
Am I aware of what’s living in CPU RAM and what’s in GPU VRAM?
Am I measuring the cost of transfers, or just hoping they’re fast enough?
PyTorch won’t hold your hand here. And that’s a good thing.
The Tradeoff Between Precision and Speed
Once you're using the GPU, another layer of complexity appears, tensor precision.
Most of the time, you'll be working with 32-bit floats—float32
. It's the default. It strikes a good balance: fast, accurate, widely supported.
But as your models grow (or your hardware gets way more powerful) you’ll start to hear whispers of something else: float16
.
Half precision.
It’s faster. It uses half the memory. It allows you to fit larger models on the same GPU. It's what powers much of the speed in modern transformer training pipelines.
But it’s not magic. It’s risky.
With float16, you’re working closer to the edge of what your GPU can represent. Gradients might vanish. Losses might explode. Layers that work perfectly in full precision might become unstable.
Debugging becomes way more trickier. A model that “just works” in float32 might require careful attention in float16.
And there’s also float64
, double precision, but it's mostly for scientific computing, not deep learning. It’s slower. Way heavier. Not often needed unless you’re doing sensitive numerical work.
The deeper point is this: you have to choose.
Every tensor has a data type. And every data type is a compromise between speed, memory, and numerical stability. PyTorch gives you the ability to control it, but doesn’t make the choice for you.
So don’t just accept the defaults. Ask yourself:
Is float32 precise enough for my task?
Will float16 introduce instability?
Is my hardware optimized for mixed-precision training?
Am I measuring, or just guessing?
The answers matter more than you think. Especially when you're tuning the last few percent of a model's accuracy, or trying to fit one more layer into memory.
Precision. Devices. Shapes. Gradients. Ready?
By now, you’ve essentially touched every atomic layer of PyTorch.
You’ve seen that tensors aren’t just containers of data: they’re alive systems. They carry gradients. They know where they live. They transform. They multiply, contract, expand, and broadcast across dimensions.
They’re the direct form of communication of deep learning, and the only way to truly understand your model is to understand them.
You’ve also seen that PyTorch doesn’t abstract these things away. It asks you to be precise. To manage your own shapes. To know where your data lives. To trace each computation from input to loss.
And that’s a feature, not a bug.
If you’ve been through the struggle of debugging a model where nothing seems to work: no gradients flow, losses are nan
, accuracy is frozen, you know how valuable that level of control becomes.
This is why PyTorch isn’t just a framework. It’s a mindset. A system of thinking.
Coming Up: The PyTorch Project Journey
Now that we’ve walked the foundations, it’s time to start building.
The next step is the real thing: an actual PyTorch project. Not just snippets of code; but a structured, end-to-end training pipeline.
We’ll walk through how to:
Load and preprocess data in a way that scales beyond toy examples.
Structure your model class using
nn.Module
, and why that abstraction matters.Write a training loop that doesn’t just work once, but works well, every time.
Track loss and accuracy, across epochs, batches, and checkpoints.
Visualize results, decision boundaries, and training performance.
Save and load models, in ways that make deployment and iteration smooth.
We’ll use classification as the anchoring task. Something real, something visual, something you could build into a product or demo.
But underneath all of it, we’ll keep asking the same questions:
What are my tensors doing?
Where are they?
How are they shaped?
What flows forward, and what comes back through the gradients?
Because once you understand that, you can build anything.
And once you master the flow of data, computation becomes a form of thought.
Let’s move forward. Let’s build.
Let’s go deeper.