How Facebook Built Presto: The SQL Engine That Ate the Data Lake

Explore how Presto reinvented SQL analytics with in-memory pipelines, bytecode magic, and data-aware optimizations.

Lorenzo Bradanini and Lorenzo Tettamanti

May 16, 2025

This post is the kind of deep dive I wish I had at least 2 years ago.
We’ll look carefully at why Presto was built, when it was, what made its architecture radically different from MapReduce-based engines, and how it quietly reshaped the way modern data systems are designed. Plus additional insights that you’ll learn along the way.

We’ll explore:

Presto’s pipelined, in-memory query execution model,
the modular connector architecture,
its federated SQL capabilities,
the shift from brittle batch jobs to fast, multi-store analytics,
and the lessons Presto taught us about building real-time, distributed data systems at scale.

No more waste of precious time: let’s dive in.

Introduction

Back in 2012, Facebook was literally drowning in data.
And not in the poetic, “data is the new oil” or LinkedIn-influencer sense. I mean actually drowning—in petabytes of logs, user interactions, clickstreams, photos, and service events pouring into HDFS every second of the day.

At the time, Facebook’s core data infrastructure was built on Hadoop, with Hive layered on top. Hive had been revolutionary just a few years earlier. It brought SQL semantics to MapReduce, making it possible to write analytical queries using familiar syntax. But by 2012, it was showing its age. Hive was fundamentally batch-oriented. Every query, no matter how simple, had to compile into slow, heavyweight MapReduce pipelines.

Want to check yesterday’s metrics? That’s a 40-minute job. Want to power a real-time dashboard? Forget about it.

Internally, the demand for fast, interactive analytics was exploding. Engineers wanted sub-second answers. Product managers wanted dashboards that updated as fast as users clicked. Data scientists wanted to explore data stored across Hive, MySQL, and Cassandra—without learning three different systems or building glue code to bridge them. The data was there. The tools weren't, at all.

So a small team at Facebook did what seemed borderline insane at the time: they decided to build a new distributed SQL engine—one that wouldn’t rely on MapReduce at all, could federate across multiple backends, and could process terabytes in seconds, not hours.

That system soon became known as Presto.

Presto wasn’t just built to replace Hive—it was built to escape it. Its architecture flipped everything on its head. Instead of compiling SQL into batch jobs, Presto executed queries in memory using a pipelined, MPP (massively parallel processing) model. Queries were parsed, optimized, and split into stages, then executed across a pool of long-running JVM workers. No startup overhead. No serialization bottlenecks. Just raw, vectorized compute streaming results back as fast as the cluster could push them.

This wasn’t just fast as hell—it was flexible. Presto introduced a connector architecture that let it query virtually any data source: HDFS, S3, MySQL, Cassandra, Kafka—even proprietary internal stores. You could join a CSV in S3 with a table in MySQL and a log stream in Kafka—all in one query, all in SQL. No more exporting. No more format gymnastics. Just correct answers.

Launched internally on November 10th, 2013, Presto quickly became Facebook’s go-to engine for interactive analytics. It powered A/B testing dashboards, product metrics, internal developer tools, and anything that demanded fast access to fresh data. Its multi-tenant model allowed different teams to share clusters safely.

Its efficient memory-conscious design let it operate efficiently at petabyte scale. And its connector ecosystem made it one of the most extensible SQL engines ever built.

Eventually, the industry caught on really fast. Netflix, Uber, and Airbnb adopted Presto to power their own data platforms. Amazon took it further, launching Athena, a serverless interactive query service, based entirely on Presto under the hood.

Share The Software Frontier

A Detailed Overview

At Facebook, Presto brielfy evolved into a versatile, yet extremely powerful engine, capable of supporting a wide array of use cases, each with distinct requirements and challenges:

Interactive Analytics

Facebook operates a massive multi-tenant data warehouse as an internal service, where several business functions and organizational units share a smaller set of managed clusters.

Data is stored in a distributed filesystem and metadata is carefully stored in a separate service. These systems have APIs that are actually very similar to that of HDFS and the Hive metastore service, respectively. In this post, we refer to this as the ‘Facebook data warehouse’, and use a variant of the Presto ‘Hive’ connector to read from and write to it.

Facebook engineers and data scientists routinely examine small amounts of data (50GB–3TB compressed), test hypotheses, and build visualizations or dashboards. Users often rely on query authoring tools, BI tools, or Jupyter notebooks.

Individual clusters are required to support at least 50–100 concurrent running queries with diverse query shapes, and return results within seconds or minutes.

Users are highly sensitive to end-to-end wall clock time and may not have a good intuition of query resource requirements. While performing exploratory analysis, users may not require that the entire result set be returned.

Queries are often canceled after initial results are returned or use so called “LIMIT clauses”, to restrict the amount of result data the system should produce.

Batch ETL

The data warehouse we described above is populated with fresh data at regular intervals using ETL queries.

Queries are scheduled by a workflow management system that determines dependencies between tasks and schedules them accordingly. Presto supports users migrating from legacy batch processing systems, and ETL queries now make up a large fraction of the Presto workload at Facebook by CPU.

These queries are typically written and optimized by data engineers. They tend to be much more resource-intensive than queries in the Interactive Analytics use case and often involve performing CPU-heavy transformations and memory-intensive (we’re talking about many TBs of distributed memory) aggregations or joins with other large tables.

Query latency is somewhat less important than resource efficiency and overall cluster throughput.

A/B Testing

A/B testing is used at Facebook to evaluate the impact of product changes through statistical hypothesis testing.

A huge part of the A/B test infrastructure at Facebook is built on Presto. Users reasonably expect test results to be available in hours (rather than days) and that the data be complete and accurate. It is also important for users to be able to perform arbitrary slice and dice on their results at interactive latency (5–30s) to gain deeper insights.

It is difficult to satisfy this requirement by pre-aggregating data, so results must be computed on the fly. Producing results requires joining multiple large data sets, which include user, device, test, and event attributes.

Query shapes are restricted to a small set since queries are programmatically generated.

Developer/Advertiser Analytics

Several custom reporting tools for external developers and advertisers are built on Presto.

One example deployment of this use case is Facebook Analytics, which offers advanced analytics tools to developers that build applications which use the Facebook platform.

These deployments typically expose a web interface that can generate a restricted set of query shapes. Data volumes are large in aggregate, but queries are highly selective, as users can only access data for their own applications or ads.

Most query shapes contain joins, aggregations, or window functions. Data ingestion latency is in the order of minutes. There are very strict query latency requirements (50ms–5s) as the tooling is meant to be interactive. Clusters must have at least 99.999% availability and support hundreds of concurrent queries given the volume of users.

Inside the Engine: How Presto Works

At a high level, a Presto cluster can be better visualized by splitting it into two main components:

A single coordinator
A huge fleet of worker nodes.

The coordinator acts as the brain—it handles incoming queries, parses and analyzes the SQL, builds a distributed execution plan, and orchestrates execution. The workers? As you may expect, they do the heavy lifting.

Here’s what happens when you send a query to Presto.

The client (usually a CLI, a notebook, or some dashboard tooling) sends an HTTP request with a SQL statement to the coordinator.

First, it checks queue policies to make sure the cluster isn’t overloaded. Then it parses and analyzes the SQL, builds a logical plan, optimizes it, and breaks it into a series of stages and tasks. Each task is mapped to a “split”—a unit of data access from an external system.

These splits are distributed across the worker nodes, which begin processing them in parallel. The workers fetch data directly from external systems—like, for example, HDFS, S3, MySQL, or Kafka—using a connector interface, and they execute each task in a pipelined, in-memory fashion.

Data flows between tasks as soon as it’s available, and Presto is smart enough to return early results for certain query shapes before all the data has been processed.

Under the hood, workers use cooperative multitasking to execute multiple queries concurrently. Intermediate data and execution state are kept in memory whenever possible, and data shuffles are buffered and tuned for low latency. It’s a pure MPP engine: no spinning up new JVMs, no MapReduce-style serialization—just distributed, low-latency, highly efficient SQL.

Presto is also built to be extensible. Its architecture is plugin-based, meaning nearly every subsystem—data types, functions, access control, event hooks, queue policies—can be customized or replaced. But the real power lies in its connector system.

A connector in Presto is built using four key APIs: Metadata, Data Location, Data Source, and Data Sink. This gives Presto the flexibility to query any data source that can expose these APIs—even if it’s a proprietary database, an in-house blob store, or something totally custom.

Over the years, the Presto ecosystem has exploded with dozens of community-built and commercial connectors. At Facebook, this architecture scaled to petabyte workloads across Hive, MySQL, Raptor, and sharded internal systems. In the broader world, it let Presto unify querying across S3, Kafka, Cassandra, and anything else you can throw SQL at.

How Presto Turns SQL into Speed

So you’ve written a SQL query, hit "Enter," and Presto returned your results in seconds. But what actually just happened? Did the query planner cast a spell? Did a thousand threads wake up in the cloud and shout in unison?

Well, sort of.

This section is the backstage pass: we’ll follow your query’s lifecycle—from SQL parsing to things like distributed execution—highlighting the clever tricks, aggressive optimizations, and design decisions that make Presto so damn fast.

Along the way, we’ll stop to admire some esoteric features, raise some questions (“Why does ANSI SQL hate arrays?”), and maybe make fun of shuffles, just a little, I promise.

The SQL Dialect: ANSI… With Benefits

Presto starts with a solid foundation in mind: ANSI SQL. That means no awkward surprises—if you’ve written some SQL before, you’re already halfway there.

But let’s be honest: ANSI SQL wasn’t natively built for JSON blobs, maps, and deeply nested arrays. Want to filter an array of structs based on an embedded timestamp field? Good luck doing that in plain SQL, without losing your sanity.

This is where Presto’s thoughtful extensions come in. The engine introduces higher-order functions like transform, reduce, and filter, along with support for inline lambdas. This turns SQL into a more expressive language—almost like functional programming meets relational algebra. It’s SQL that actually wants you to manipulate arrays and maps, not punish you for trying.

Think of it this way: in Presto, you can filter(x -> x > 10) inside a column. In standard SQL, you’d be writing a UDF and sacrificing a goat.

The Early Stages of a Query

Let’s follow closely your query through the engine.

1. Client Interfaces

First, your query hits the Presto coordinator. Presto speaks fluent HTTP—it exposes a clean RESTful API. That’s great for integration and observability. But if you’re more old-school (or like to feel powerful in a terminal), there’s also a CLI. Or maybe you’re a data analyst living inside Tableau—Presto’s JDBC driver has you covered too.

Basically, if you can write SQL, you can talk to Presto. And Presto listens carefully.

2. Parsing

Once received, Presto hands your query to an ANTLR-powered parser. This transforms your SQL into an Abstract Syntax Tree (AST)—a structured, machine-readable representation of what you're trying to say. Next, a semantic analyzer checks it for correctness:

Are those columns real?
Are you accidentally summing a VARCHAR?
Is that function overload unambiguous?

This step is like the TSA of query execution—your bags are getting checked.

3. Logical Planning

Once the query passes inspection, Presto builds a logical plan. This is a tree of abstract operators like Scan, Join, Aggregate, and Filter. It describes what needs to happen, not how to do it. Think of it as the declarative blueprint of your query.

Let’s visualize this query:

SELECT
  orders.orderkey,
  SUM(tax)
FROM orders
LEFT JOIN lineitem
  ON orders.orderkey = lineitem.orderkey
WHERE discount = 0
GROUP BY orders.orderkey;

Presto translates that into this logical structure:

Aggregate [SUM(tax)]
  |
LeftJoin [ON orderkey]
  /       \
Filter     Scan [orders]
[discount=0]
  |
Scan [lineitem]

Notice how modular it is. Each node can be optimized, split into a stage, or even pushed into a connector.

Query Optimization: Where the Magic Happens

Once the logical plan is ready, Presto invokes its optimizer. Now it gets interesting.

Presto’s optimizer is rule-based and greedy : it applies transformations until the plan stops changing. These rules aren’t random at all; In fact, each has a pattern, a trigger condition, and a rewrite strategy. For instance, if a filter can be pushed below a join, the rule does it.

Let’s walk through a few highlights.

1. Cost-Based Join Decisions

While many rules are syntactic, Presto has two key areas where it uses actual data statistics: join order selection and join strategy.

Let’s say you're joining three tables. There are six possible join orders. Add one more table? Now there are 24. The optimizer evaluates various join permutations using cardinality estimates and picks the best one.

And yes, if those estimates are off (because the stats are stale), the optimizer might go rogue and pick a Cartesian product. We've all been there.

2. Data Layouts: Multiple Views of the Same Table

Connectors in Presto can expose multiple physical layouts of the same table—each with different sorting, indexing, or partitioning. This gives the optimizer options.

You want to filter on a timestamp? It can choose a layout sorted by date. Joining by user ID? Pick the layout with the hash index. It’s like having a tailored version of the same table for every query shape. Fancy, right?

This is heavily used at Facebook for advertiser analytics, where query patterns are known and indexed views can be precomputed and plugged in.

3. Predicate Pushdown: Because Reading Less Data is Good

Presto connectors often support predicate pushdown. That means filters like discount = 0 can be passed down into the storage engine—whether it’s Hive, Iceberg, or a MySQL shard. Why scan the whole table when you only need a slice?

Even better, if the connector supports indexes, Presto will choose the layout that lets the filter hit the index. In MySQL, this can mean hitting just one shard instead of 20. In Hive, it might prune 90% of the partitions.

As a bonus: the less data read, the fewer rows shuffled. And speaking of shuffles...Here we go

4. Parallelism and Shuffles: Presto’s Distributed Brain

Once the plan is optimized, it’s split into stages that run in parallel across the cluster. Each stage is made up of tasks—essentially JVM threads processing a “split” of the data.

Between stages, Presto inserts shuffles—buffered data transfers that redistribute data (usually for joins or aggregations). These are necessary, but also expensive. Shuffles use memory, burn CPU, and can become bottlenecks.

The goal of the optimizer is to keep shuffles minimal and aligned with the execution plan. You don’t want to introduce a shuffle just because two joins were written in the wrong order.

Here’s a simplified visualization:

 Stage 3: Final Aggregation
           ↓
    [Buffered Shuffle]
           ↓
 Stage 2: Partial Aggregation
           ↓
    [Buffered Shuffle]
           ↓
 Stage 1: Scan + Filter + Join (fully parallelized)

The engine also has to reason about data skew—what happens if one worker gets 90% of the rows? (Hint: it explodes. Then you debug for two hours and become crazy.)

Smarter Execution with Data Layouts

When connectors expose how data is physically laid out — for example, whether it's partitioned, sorted, or indexed — the Presto optimizer can take advantage of that structure to avoid expensive operations like shuffles.

This is especially useful in A/B testing, where almost every query needs a heavy join to pull in experiment or user population info. If both tables in the join are partitioned on the same column, Presto can use a co-located join, skipping the usual shuffle entirely. That’s a huge performance win.

Similarly, if a connector exposes join keys as indexed, the engine can switch to an index nested loop join, which works great for normalized data — like when you're joining dimension tables in a data warehouse with production key-value stores. This pattern shows up a lot in interactive analytics, where latency truly matters.

Reducing Shuffles by Exploiting Plan Properties

Execution nodes in a query plan can also declare properties about the data they produce — such as how it’s partitioned or sorted. These “node properties” guide the optimizer: it can drop unnecessary shuffles or tweak shuffle behavior to satisfy as many constraints as possible in one go.

For instance, rather than partitioning on every grouping column, Presto might settle for fewer to reduce overhead. This can cause some skew, but often it’s a smart trade-off — it may even collapse what used to be four separate stages into one, dramatically cutting down execution time.

Going Wide: Intra-Node Parallelism

Presto also looks for opportunities to parallelize within a single node. Since threads on the same machine can share memory and skip network latency, this is a cheap and efficient way to speed things up.

This helps in two common situations:

Interactive queries that group by a high-skew column (e.g., user country) and filter to a few values.
Batch ETL jobs that read huge volumes with little filtering, causing upstream stages to bottleneck downstream ones.

In these cases, the engine can parallelize parts of the plan — like building hash tables — across multiple threads. This kind of fine-grained parallelism is especially useful when concurrency becomes the limiting factor.

Tasks, Pipelines, and Scheduling

Presto breaks down query plans into stages, and each stage into tasks. A task is a unit of work executed on a single worker, and internally it's made up of pipelines — sequences of operators like filters, joins, or aggregates.

Each task can run one or more pipelines concurrently, and pipelines themselves can be split and parallelized if needed. For example, the hash table build side of a join can run across several threads, while the probe side streams through in parallel.

There are two main ways to schedule stages:

All-at-once: Runs all stages as soon as data is ready. Great for latency-sensitive queries.
Phased: Waits until dependencies are met (e.g., hash build before hash probe). More memory efficient, ideal for heavy batch jobs.

Split Scheduling: Lazy, Adaptive, and Efficient

Each task in a leaf stage (the ones that read actual data) is assigned splits — think: file chunks, table ranges, or key buckets. Presto assigns splits lazily in small batches. This has a bunch of benefits:

Queries can start running before all splits are even known. This is great for quick queries with LIMIT or heavy filters.
Workers keep a small split queue, allowing dynamic balancing across machines.
No need to load all metadata up front — a big deal for Hive, where you could easily be looking at millions of files.

This lazy, demand-driven approach is why Presto works well for both fast exploratory queries and large-scale batch processing.

Local Execution & Fast Shuffles

Once a split lands on a worker thread, it runs through Presto’s driver loop — a tight, cooperative pipeline executor that’s smarter than traditional “pull-based” models. It moves pages of data (columnar batches) between operators as long as there’s work to do, squeezing out every drop of CPU per thread quantum.

As for shuffles — the bane of distributed systems — Presto avoids disk where it can. It uses in-memory, HTTP-based long-polling to ship data between nodes. This keeps latency low, buffers fast, and performance snappy — ideal for workloads that can’t afford to wait around.

Presto’s Power: Codegen, Compression, and JVM Wizardry

When you're building a distributed query engine that needs to scan petabytes of data with sub-second latency, you don’t get to leave performance on the table. Presto’s architecture is deeply aware of this fact.

In this detailed section, I’ll walk you through some of the sophisticated—sometimes counterintuitive—optimizations that make Presto way more than just a SQL engine.

Think of it as the behind-the-scenes choreography between Java, memory, and machine code.

A lot of informations to digest, I know, but these are crucial for a real understanding of Presto. Let’s take a deep breath and…let’s go!

JVM: The Hotspot Dance

Presto is written in Java, which comes with both superpowers and tradeoffs. On one hand, you inherit the robustness and great safety of the JVM; on the other, you’re dancing with a Just-In-Time (JIT) compiler that you don’t directly control and its intrinsic complexity and overhead issues. But instead of fighting the JVM, Presto embraces its quirks.

Critical performance-sensitive logic like histograms, checksums, and compression routines are hand-crafted to align with what the JVM wants to optimize: inlined methods, unrolled loops, monomorphic call sites. Presto even explores using Graal for specific workloads (like 128-bit math), where Hotspot hits a wall.

Garbage collection? That’s another battlefield. Presto opts for G1GC, but it comes with landmines—particularly around “humongous” objects.

The opted solution?: flat memory arrays instead of object graphs. That histogram you're aggregating? Behind the scenes, it’s just smart arrays—no individual objects cluttering the GC’s world.

Code Generation: Writing Bytecode on the Fly

Interpreting expressions row-by-row? Too slow. Instead, Presto dynamically writes its own Java bytecode at query runtime. This lets it skip interpretive overhead and lean hard into the JVM’s JIT pipeline. It's not just fast—it's tailored.

But there’s more. Presto sidesteps the usual pitfalls of generic code by generating query-specific bytecode for operators. It unrolls loops when it knows column types, eliminates branching when it can precompute logic, and turns variable references into constants when possible. Think of it as building a just-for-you bytecode microservice for every task.

Because each piece of generated bytecode becomes its own Java class, the JVM can independently profile and optimize them. The actual result? A feedback loop between data and machine code, where the engine reshapes itself based on what it sees.

Columnar Layouts and File Formats

Presto thrives on columnar data, and it loves smart file formats like ORC and Parquet. These formats offer stats, bloom filters, and index-aware skipping. Instead of slurping everything into memory, Presto surgically reads only what’s needed. Bonus: it reads it in a flat, pointer-free layout, so tight loops stay tight.

When you scan a table, Presto doesn't just get rows—it gets pages, composed of block types like:

LongBlock for integers,
RLEBlock for repeated values,
DictionaryBlock for categorical values like "IN PERSON" or "RETURN".

This structure isn’t accidental. It’s a pipeline tuned for vectorized execution—where branch prediction, caching, and SIMD all come into play.

Lazy Loading: Don’t Decode What You Don’t Need

What if you don't touch most of a column? Presto leans on lazy blocks that only decode data if the filter passes.

In production ETL queries, this trick reduced the number of loaded cells by 22% and total CPU time by 14%. Not bad for a strategy that essentially says: “I’ll wait to care until you do.”

Operating on Compressed Data: Think in Dictionaries

Here's where it gets clever. Presto doesn’t always decompress data right away. Instead, it performs operations directly on compressed blocks—especially useful when working with dictionary-encoded or run-length-encoded values.

For example, if you're joining on a returnflag column with only a few values like "O", "F", or "N", Presto just joins on the indices into the dictionary. If values repeat across blocks, it reuses hash table entries instead of recomputing them from scratch.

Even intermediate query results are sometimes compressed. A join operator might output a dictionary block instead of materialized rows—making joins faster and lighter on memory.

Final Thoughts

Presto isn’t just a SQL engine—it’s a manifesto on how to bend modern systems architecture toward speed, scale, and simplicity without compromise. At the surface, it gives analysts a familiar interface; under the hood, it’s a masterclass in distributed computing.

Everything in Presto is tuned for throughput: from dynamically generated bytecode that lets the JVM optimize on-the-fly, to memory-efficient columnar processing that minimizes waste and maximizes locality. Its operators think in blocks, not rows. Its execution plans adapt to data, not just syntax. And its use of compressed representations, lazy decoding, and dictionary-level joins proves that performance isn't just about clever algorithms—it's about deep empathy for data.

What makes Presto special is that it doesn’t take performance as an afterthought—it treats it as an architectural pillar. This is a system that operates at the intersection of compiler tricks, storage efficiency, and runtime orchestration—without giving up the declarative beauty of SQL.

In the world of data engines, many scale. Few perform. Presto does both—with elegance.

The Software Frontier

Discussion about this post

Ready for more?