Apache Hudi and the Future of Data
How Uber Designed Apache Hudi for the Future of Real-Time Analytics
There’s a difference between solving a problem and building infrastructure to make the problem obsolete.
Introduction
In 2015, Uber found itself staring into the abyss of a familiar but invisible enemy: not bad code, not downtime, not broken dashboards. But data lag.
This wasn’t just a performance bottleneck. It was a systemic risk: one that grew in lockstep with Uber’s global ambitions.
The company was no longer just matching riders and drivers in San Francisco. By 2015, Uber was operating in over 60 countries.
Pricing models had evolved from flat rates to real-time surge algorithms. Fraud detection systems had to analyze behavioral patterns on a per-ride basis.
Estimated times of arrival (ETAs) were being recalculated mid-journey based on changing traffic, weather, and routing intelligence. Internal teams wanted to experiment, iterate, and analyze with the same immediacy that the app promised its users.
But while the business demanded real-time intelligence, the data backend was still stuck in yesterday.
Uber’s data systems were pumping in hundreds of terabytes per day: trip telemetry, GPS coordinates, payment logs, messaging metadata, dynamic pricing factors, and customer support traces.
But most of this data had to wade through Vertica, Hive, HDFS, and monolithic batch pipelines before it became usable.
And batch, at Uber scale, had come to mean one thing: latency.
Latency in ingestion. Latency in transformation. Latency in access. Every data scientist or analyst who asked a question was effectively asking for results that would arrive, if lucky, by the next day.
But it wasn’t just about speed. The data itself was alive: it mutated constantly. Drivers corrected past rides. Customers disputed payments. Devices uploaded logs asynchronously. Events arrived late, misordered, corrupted, or duplicated.
To keep the analytical layer consistent, Uber’s teams were forced into a vicious cycle: they rewrote full datasets every night, often reprocessing 100 terabytes of data just to capture 100 gigabytes of actual change.
It was computationally absurd. But worse, it was intellectually regressive. Teams couldn’t experiment. They couldn’t build models that updated in minutes. They couldn’t close the loop between insight and impact.
More Spark executors wouldn’t solve this. Faster queries on broken architecture would only paper over the cracks. The issue wasn’t downstream, it was foundational.
Uber didn’t just need a better data pipeline.
They needed a new kind of storage abstraction.
They needed a way to capture change, track version history, and query incrementally. They needed to write like a database and read like a warehouse.
They needed something that could coexist with the messiness of streaming systems but offer the orderliness of batch.
So…. They built it.
And they called it Hudi, which stands for Hadoop Upserts, Deletes, and Incrementals.
Hudi was never just a table format. It was a radical rethinking of what it meant to manage data at scale.
It didn’t just patch over Hadoop’s limitations; it treated mutability, time, and evolution as first-class concerns. It embedded lineage into the filesystem.
It also carried a timeline in its core. It spoke the language of streaming without losing the rigor of ACID.
In doing so, Hudi didn’t just solve Uber’s ingestion problem.
It made daily rewrites obsolete.
It made incremental processing native.
It made real-time analytics on a data lake possible.
And it opened a new chapter in the evolution of modern data platforms, one where data is no longer a static asset to be warehoused and queried, but a living, breathing stream to be continuously refined, versioned, and served.
The Second Generation Data Platform
Uber’s first-generation data platform was built on a deceptively simple architecture.
Operational data, like rides, payments, events, flowed into Vertica. From there, analysts could query it with SQL, build dashboards, model performance, and run ad-hoc investigations. It was very fast, clean, and expressive.
But as Uber expanded, both geographically, operationally, and even technically, cracks began to show.
Vertica was very expensive. Storage costs ballooned with data growth. More importantly, it wasn’t built for the kind of write-heavy, high-velocity ingestion Uber was facing.
Every new team, service, and use case meant new data streams; that required growing not linearly, but exponentially. The volume wasn’t just increasing: it was mutating.
The answer seemed obvious: migrate to Hadoop. Adopt HDFS for cheap, distributed storage. Embrace Parquet for compression and scan efficiency. Use Hive and Spark for query and transformation.
And for a while, yes… It worked.
Parquet compressed Uber’s event data by orders of magnitude. Spark gave engineers the power to process petabytes with elegant code. Data democratization accelerated sharply.
But underneath the surface, a deeper problem began to emerge. A mismatch between what Uber needed from their data, and what the new system was capable of providing.
The Immutable Wall
Parquet, as we all probably know, is a columnar format. It shines in read performance. But its design assumes immutability. You don’t update a single row in a Parquet file. You rewrite the file entirely.
This is fine for static datasets. It’s catastrophic for dynamic ones.
At Uber, data was never still.
Rides were updated with tips, cancellations, reassignments. Driver locations changed every few seconds. Payments were retried, split, or failed. Each correction, each status update, each new event meant touching old records.
But in a world of immutable files, that meant rewriting massive swaths of data.
Every day, Uber was updating around 100GB of data across tables. But because of how Parquet works, they had to rewrite 100TB worth of files. Daily.
Let that sink in: 100GB of useful changes → 100TB of redundant rewrites.
And this wasn’t just expensive. It was very risky.
Massive rewrites created race conditions, broke lineage, introduced inconsistencies, and put pressure on the underlying infrastructure. Recovery from failures became a nightmare.
You could lose an hour’s worth of computation to a single file-level conflict.
Worse still, the data itself was messy. Late-arriving events from mobile apps in poor network conditions. Out-of-order events from upstream systems. Corrupted or malformed rows that needed to be fixed after ingestion.
The new platform didn’t tolerate any of that well. It lacked memory. It lacked nuance.
Uber needed something more fundamental than storage or schema evolution. It needed a system that could track, mutate, and stream data: all while preserving the economics of a data lake.
They didn’t just need files. They needed increments.
Hudi: Hadoop Upserts, Deletes, and Incrementals
At first glance, Hudi seems like just another layer in the ever-growing Hadoop stack.
But spend time with it, read its commit logs, follow its timeline, trace how it handles updates, and you realize it’s doing something more radical.
It’s changing the rules of the data lake.
Not by tearing down what came before, but by wrapping around it—augmenting immutable systems like HDFS and Parquet with mechanisms they were never designed for: mutation, versioning, memory.
You can think of Hudi not as a file format or a storage engine, but as a control plane for data evolution. A meta-layer that brings logic and coordination to a world of frozen blocks.
Its design revolves around three deceptively simple goals:
Upserts and Deletes, Built In
Hadoop’s greatest strength, immutability, is also its greatest weakness.
Once data is written to Parquet, it can’t be changed without rewriting entire files. This is fine for append-only logs, but real-world systems aren’t that clean.
As we said earlier, data was constantly changing: GPS signals arrived late, fraud models flagged old rides, Kafka replays created duplicates. Fixing any of this meant brute-force rewrites of massive datasets.
Hudi’s first promise was: you can update your data.
It achieves this by shifting the unit of mutation. Instead of treating a Parquet file as atomic, Hudi introduces a new abstraction: the file slice, a combination of a compacted base file and a chain of append-only logs.
Writes go to logs. Periodically, logs are merged back into base files through compaction. The result? You can upsert individual records, without rewriting everything.
Deletes, too, are just special types of updates, marked and processed through the same log structure. No need for fragile workarounds or full refreshes.
Incremental ETL, Natively Supported
Most ETL jobs today are stuck in a wasteful loop: scan all the input, process everything again, overwrite the output; even when only a tiny fraction of records changed.
This made sense when you couldn’t track changes. But Hudi’s second promise is: you can know what changed.
Under the .hoodie/
directory of every Hudi table lies a timeline: a structured ledger of all writes, compactions, cleanups, and rollbacks, each with a unique ID. It records exactly what happened, when, and to which file groups.
This unlocks powerful capabilities:
Incremental queries, where pipelines read only the new or updated records since the last checkpoint.
Time travel, to replay the state of the table at any historical moment.
Auditability, with a full changelog baked into the table itself.
ETL jobs no longer have to blindly reprocess everything. They can subscribe to data evolution, just like logs or CDC streams, but backed by a durable, queryable file system.
A Unified Table Model for Batch and Stream
Traditionally, you had to choose: batch or stream. Your tables were either ingesting micro-batches or appending log records.
The APIs were totally different, the semantics mismatched, and bridging the two was awkward.
Hudi says: why not both?
Its architecture supports streaming ingestion: via Kafka, Spark, Flink, or other systems, but writes them in a way that’s safe, idempotent, and consistent with batch semantics.
Streaming upserts are written as delta_commits
, just like batch writes. The timeline makes no distinction at the table level.
Reads don’t care whether a record came from a micro-batch or a streaming job: they just query the latest file slice.
In other words, Hudi gives you a single, coherent abstraction over two very different ingestion models.
This is especially powerful for companies like Uber, where the line between real-time and offline data is blurred. Feature stores, ML pipelines, user metrics: all need fresh data, but with batch-grade correctness. Hudi delivers both.
A Rethink of the Write Path
The data lake was never meant to be fast.
It was born from a simpler time: built to archive, not to act. You wrote once, read many, and prayed you never had to update anything.
But the modern data stack doesn’t play by those rules anymore. Streaming pipelines, real-time dashboards, machine learning feature stores: these demand a lake that can move.
One that doesn’t just store state but understands change. One that doesn’t collapse under the weight of constant mutation.
That’s where Apache Hudi comes in.
It doesn’t just bolt on new capabilities. It rethinks the fundamentals: how data is written, tracked, versioned, and reconciled. How it remembers. How it evolves.
Let’s unpack how.
Writes, Reimagined
In most formats, a write is a final act. A new file lands in a directory, immutable and dumb. If you want to update a record, you rewrite the whole thing or layer on external complexity to keep track of what changed.
Hudi flips that logic.
Writes aren’t final: they’re logged. Every insert, update, and delete is appended to a log file, preserving the chronological order of changes.
These logs form a kind of write-ahead memory, decoupling ingestion from reconciliation.
This gives you options:
Need fast reads? Scan compacted base files.
Need freshness? Merge base files with logs on the fly.
Writes become a streaming-friendly, append-only operation. Compaction, the merging of logs into efficient columnar storage, becomes a background process, completely decoupled from the write path.
And behind it all is an index that maps each record to its physical location, making upserts viable at scale.
Timelines: The Memory of the Lake
Hudi tables don’t just store data, they have to remember how it got there.
At the heart of every table is the timeline, a journal stored under .hoodie/
that records every change, every failure, every compaction. It’s the system of record for the system of record.
Each timeline action gets a unique instant ID, shaped like this:
<timestamp>.<action>[.<state>]
Common Actions:
commit
: a batch insert or upsertdelta_commit
: a streaming writecompaction
: merging logs into base filesclean
: removing obsolete filesrollback
: undoing failed writessavepoint
: pinning a consistent snapshot
Each action flows through three states:
REQUESTED
: scheduled but not startedINFLIGHT
: currently runningCOMPLETED
: successfully finished
This timeline enables:
Time travel: Query a table as of a specific instant
Incremental pulls: Only read new data since the last checkpoint
Concurrent write resolution: Detect and resolve conflicts
Smart cleanup: Only delete files no longer referenced by any timeline
Your data lake now has a memory; and with memory comes great power.
The Dual-File Architecture
A Hudi table isn’t just a pile of Parquet files.
It’s a structured layout of file groups, each identified by a unique fileId
. Each file group is a self-contained history of one slice of the table.
Inside each group, you find file slices:
A Base File: Columnar, compacted, efficient. Your typical Parquet.
One or more Log Files: Append-only, row-oriented, storing incremental updates in formats like Avro.
This layout gives you the best of both worlds:
Fast writes: Land in logs.
Fast reads: Prefer base files.
Freshness when needed: Merge on the fly.
This design allows Hudi to ingest tens of thousands of records per second, without grinding the system to a halt.
Log files act as high-speed buffers. Base files get rebuilt only when needed.
The Art of Compaction
In a Merge-on-Read (MoR) table, compaction is the moment logs and base files reconcile.
You can think of it as data reconciliation, turning a fragmented history into a consistent snapshot.
Hudi gives you options:
1. Async Compaction (Default)
Compactions run in the background, completely decoupled from ingestion. Ideal for systems that prioritize write throughput over read latency.
Pros:
High write velocity
No ingestion slowdowns
Cons:
Slightly stale reads unless logs are merged at read time
2. Inline Compaction
Trigger compaction right after a write operation. Keeps data fresh and read-optimized but slows ingestion.
Best for:
Use cases with low-latency analytics needs
Simpler orchestration pipelines
3. Scheduled Compaction
Let an orchestrator like Airflow or Flink decide when to compact, based on logic like:
Number of unmerged log files
Age of logs
Read SLA thresholds
Merge Strategies
You don’t just merge: you decide how to merge:
Precombine logic: Which version of a record wins?
Payload classes: Custom logic to handle conflicting updates
Custom merge operators (since v0.14): Build your own semantics
This isn’t just maintenance. It’s intelligent, policy-driven optimization.
Knowing Where Records Live
If you're doing upserts at scale, you must answer one question efficiently:
Where does this record live?
Hudi’s index is how it answers that: live, at write time.
Index Types:
Bloom Index (default): Uses Bloom filters in base files. Lightweight, fast, but may need to scan multiple files on false positives.
Simple Index: Brute-force, scans all files. Accurate but expensive. Great for small datasets.
Global Bloom Index: Extends Bloom Index across partitions. Allows deduplication but increases write cost.
HBase Index: External key-value store for fast lookups. Scales well but adds operational overhead.
Bucketized Index (experimental): Designed for better load balancing and large-scale ingestion.
Once a record is mapped to a file group, it stays there, even across compactions. This ensures consistency and avoids costly reshuffles.
The index isn’t just an optimization: it’s the enabler of everything else: fast upserts, streaming ingestion, incremental queries.
What It All Means
Uber didn’t build Hudi to follow a precise trend.
They built it out of necessity: because at their scale, with data arriving by the second and decisions riding on every event, the old ways of managing data just couldn’t keep up.
The batch-oriented, immutable lake was too slow, too rigid, too expensive to reprocess in full every time something changed.
Hudi is their response to that pressure. It’s not just a tool: it’s a shift in thinking. Data isn’t static. It evolves. It accumulates history. It needs to be mutable at high throughput without sacrificing consistency.
And so Hudi reimagines the entire write path: logs instead of files, timelines instead of snapshots, merges instead of overwrites.
It’s not perfect. Iceberg may be cleaner. Delta Lake may be more widely adopted. But Hudi is raw and pragmatic, forged in the fire of real-time demands.
It reflects the mindset of engineers who couldn’t afford to wait for batch windows, who had to reconcile change as it happened and still deliver fresh, reliable insight.
In that sense, Hudi is more than a storage format. It’s a philosophy: one that treats the lake not as an archive, but as a living system.