Databricks Photon Engine: Boosting Query Performance and Optimizing Big Data Processing
A Narrow Deep Dive into How Photon Accelerates SQL Workloads on Databricks
"Sometimes it is the people no one imagines anything of who do the things that no one can imagine." — Alan Turing
I'm optimizing my learning pipeline by dissecting every single mechanism of data engineering and related things. This is where I document and share everything I’ve decoded. Not subscribed yet? Here you go:
Brief Introduction
I first got hooked on big data engines in the summer of 2023. They were so fascinating to me—massive, intricate systems, specifically designed to process unimaginable amounts of data. But as a self-taught developer, their architecture felt like an impenetrable maze, and the complexity way beyond my grasp. Fast forward two years, and here I am, not just trying to deeply understand them, but to break them down in a way that makes sense.
I almost instantly stepped into Databricks when I realized how powerful and innovative its platform was for handling big data. I want to explore how these incredible structures work, why they were built the way they were, and the engineering choices that shaped them.
As data keeps exploding in scale, businesses need analytics platforms that can keep up. Databricks has been leading the charge in big data processing, and with Photon, its next-generation query engine, it’s pushing performance optimization to a whole different level.
But what exactly is Databricks Photon, and why does it matter for modern data workloads? In this article, we’ll explore:
What Photon Engine is and how it fits into the Databricks ecosystem
How Photon improves query performance with vectorized execution and native code generation
The key architectural innovations that make Photon faster than traditional Spark SQL
Real-world benchmarks comparing Photon vs. non-Photon workloads
Whether you’re a data engineer, data scientist, or enterprise architect, understanding how Photon works can help you maximize the efficiency of your SQL analytics and big data workflows.
Let’s finally dive in and uncover how Databrick’s Photon is redefining performance in the world of cloud data processing. 🚀
Understanding Photon Engine: Its Role and Integration within the Databricks Ecosystem
The core team behind Apache Spark, who revolutionized big data processing in 2009, went on to found Databricks four and a half years later, in 2013. Their vision was to simplify the complexities of building, managing, and scaling data applications for businesses and organizations.
Their primary goal was to optimize the open-source stack they had previously developed, streamlining processes for better performance and ease of use. In doing so, they set out to create a unified platform that could handle a variety of data engineering, machine learning, and analytics tasks—all within a scalable, cloud-based infrastructure, ideal for hyper scaling and hyper performance.
To achieve this vision, Databricks embarked on a series of strategic innovations:
In 2019, Databricks introduced Delta Lake, a revolutionary table format designed to bring data warehousing capabilities to data lakes. Delta Lake solved many of the key challenges associated with traditional data lakes, such as data consistency, scalability, and query performance.
With ACID transactions, scalable metadata handling, and schema enforcement, Delta Lake bridged the gap between the flexibility of data lakes and the reliability of data warehouses, creating a more reliable and performant storage layer for big data workloads.
In 2021, Databricks released a groundbreaking whitepaper on the Lakehouse paradigm. The Lakehouse combined the best features of data lakes and data warehouses into a single architecture. It leverages the scalability and cost-effectiveness of data lakes with the management capabilities and transactional integrity typically found in data warehouses.
This hybrid approach provided organizations with a more efficient way to store, process, and analyze vast amounts of structured and unstructured data while eliminating the need for complex ETL pipelines and redundant storage systems.
Through these impressive innovations, Databricks aimed to tackle some of the inherent challenges in the traditional two-tier data architecture, which typically involves a data lake for scalable storage and a data warehouse for structured analytics. Key issues included:
Stale Data in the Warehouse: Data in traditional data warehouses can often become outdated, as they rely on batch processes to extract, transform, and load (ETL) data from the data lake. This results in delayed insights due to the time it takes to synchronize data from one system to another.
Complexity and High Cost of Consolidation: Integrating data from both systems often requires significant processing overhead, complex ETL workflows, and additional storage layers, leading to increased costs and operational complexity.
Data Duplication and Increased Storage Costs: Organizations typically store the same data in both the lake and warehouse, resulting in redundant storage costs and increased complexity in data management.
To address these challenges, Databricks introduced a managed Lakehouse solution, combining the benefits of both data lakes and warehouses in a single platform. The Delta Lake storage layer was designed to enable ACID transactions, schema enforcement, and time travel on top of the data lake, which previously lacked the ability to ensure data consistency and reliability.
By providing metadata management, data versioning, and data quality checks, Delta Lake eliminates the need to move data between disparate storage systems, reducing both the complexity and cost associated with data duplication.
On the processing side, Databricks leveraged Apache Spark as the query engine, which is natively optimized for both batch and streaming workloads. Spark’s distributed computing capabilities enable massive parallel processing across clusters, allowing users to perform complex analytics, machine learning, and data transformations directly on data stored in Delta Lake without the need to copy it into a separate data warehouse.
By merging these two powerful components—Delta Lake for scalable and reliable data storage and Apache Spark for high-performance query execution—Databricks delivers a single unified platform that streamlines the entire data pipeline, from ingestion and storage to analysis and machine learning. This integration reduces the need for separate data lakes and warehouses, cuts down on storage costs, and accelerates time-to-insight, helping organizations optimize their data architecture for scalability, efficiency, and cost-effectiveness.
As a result, Databricks continues to revolutionize how companies handle their big data workloads, enabling modern data teams to work more efficiently with less data duplication, more reliable storage, and faster analytics. The managed Lakehouse architecture simplifies workflows, improves data consistency, and provides a unified approach to handling both structured and unstructured data at scale.
The Huge Challenges
Databricks has understood (since the beginning) that building a robust data management system is not enough to stay competitive in the fast-evolving market of cloud data solutions. To compete with giants like Snowflake, BigQuery, and Redshift, Databricks needed to deliver not only reliable data management but also cutting-edge high performance in querying, processing, and analyzing vast amounts of data.
While these competitors primarily positioned themselves as cloud data warehouses, Databricks took a completely different approach by introducing the Lakehouse paradigm, which combined the flexibility of data lakes with the reliability of data warehouses. However, this presented a major challenge: this issue was actually Apache Spark. Although highly versatile, the Spark Engine was never originally designed to function as a native query engine for a Lakehouse architecture.
Infact, a proper Lakehouse query engine must be capable of handling a diverse range of data—from highly structured, organized datasets to raw, unstructured data with inconsistent layouts, scattered small files, many columns, and missing or incomplete statistics.
The challenge for Databricks was that traditional data warehouse engines were optimized for structured data with predictable schemas and indexing. In contrast, a Lakehouse system needed to deal with data variety and data quality issues—all while providing performant query processing over massive datasets in a highly distributed environment.
Databricks initially relied on Apache Spark as the query engine for its Lakehouse architecture. While Spark excelled in handling distributed workloads and processing large volumes of data, its architecture was not optimized for the high-performance querying needs of cloud data warehouses. As Spark was not designed with Lakehouse workloads in mind, performance could degrade significantly when it came to certain query types, especially complex SQL queries, ad-hoc analytics, and interactive analytics on semi-structured or unstructured data.
To solve this issue, Databricks faced a critical challenge: how to improve the performance of its query engine while preserving compatibility with the massive user base that had already adopted Spark. Replacing Spark entirely wasn’t a feasible option, as it would disrupt the existing ecosystem for thousands of customers. Instead, Databricks took a software engineering approach that would enhance the performance of Spark without abandoning it.
The solution? A novel thing called the “Photon Engine”.
Core Components and Key Releases of the Photon Engine
Photon is a next-generation query engine developed by Databricks, initally introduced in their blog in mid-2020 for their delta lake architecture, and later released in a stable format from 2021 onwards.
It was specifically designed to address the performance gaps in Spark while remaining tightly integrated into the existing Databricks platform. Rather than discarding Spark, Databricks enhanced Spark’s underlying architecture to create Photon as an optimized, high-performance engine capable of executing queries more efficiently across the diverse data types typical of a Lakehouse.
Here’s how this engineering challenge was solved:
Improving Query Execution with Vectorization: One of the key performance optimizations was the introduction of vectorized execution in the new Photon Engine. By processing data in batches (vectors) rather than rows, Photon dramatically reduced the CPU overhead associated with processing each data point individually. This optimization led to better use of CPU resources, allowing queries to process large datasets faster and more efficiently.
C++ Engine for Low Latency: Photon was developed using C++, a lower-level language, which helped overcome some of the performance bottlenecks inherent in Java-based engines like Spark. By moving away from the Java Virtual Machine (JVM), Photon eliminated JVM overhead and provided faster, more scalable execution, crucial for high-performance querying in real-time data systems.
Optimized Data Formats: Photon was built with Delta Lake's Parquet format in mind, leveraging its columnar storage properties. By focusing on predicate pushdown (the ability to filter data earlier in the query execution pipeline), Photon minimized unnecessary I/O, reducing the volume of data processed in the later stages of query execution.
Integration with Spark’s Catalyst Optimizer: Databricks didn't start from scratch; instead, they leveraged the Catalyst query optimizer built into Spark. Photon improved this by providing better query planning and cost-based optimizations. This made sure that Photon was able to efficiently handle complex, distributed queries and execute them with fewer resources.
Seamless Backward Compatibility: Another critical engineering consideration was ensuring that the transition to Photon would not disrupt existing workloads. Photon was designed to be a drop-in replacement for the existing Spark engine, allowing customers to take advantage of performance boosts without changing their existing data pipelines. This backward compatibility allowed Databricks to maintain its user base while dramatically improving query performance.
The Photon Design and Execution Engine: An In-Depth Analysis
As we’ve discussed earlier, in the pursuit of optimizing data processing for the Lakehouse architecture, Databricks made a strategic decision to move away from the traditional JVM-based Spark engine.
The primary driver behind this decision was the need for more efficient execution, particularly when dealing with large-scale and diverse workloads that the Lakehouse paradigm demands. Spark, initially developed with a focus on batch processing, faced several bottlenecks when applied to modern, highly parallelized analytics workloads. The JVM-based engine's in-memory performance, although flexible, struggled with the intense demands of modern analytics.
Several other issues led Databricks to develop a new native execution engine:
JVM Limitations for Large Workloads: The JVM engine, while versatile, could not handle the growing complexity and scale of workloads. Its in-memory performance struggled with highly concurrent operations, particularly when handling large volumes of data.
Lack of Fine-Grained Control: One of the biggest challenges was the limited control Databricks had over lower-level system optimizations. Spark running on the JVM could not leverage custom SIMD (Single Instruction, Multiple Data) instructions, which are crucial for vectorizing computations and achieving maximum performance.
Garbage Collection Challenges: Spark’s JVM-based engine required manual management of off-heap memory, adding significant complexity to the codebase. As data volumes increased, garbage collection performance worsened, particularly when heap memory exceeded 64GB. This caused frequent performance hiccups, especially with large datasets.
Memory Management Overheads
Beyond the challenges of garbage collection, memory management within the JVM also posed significant difficulties. Despite utilizing off-heap memory for better control over large data structures, this required manual intervention and further complicated the codebase. For larger workloads, JVM's automatic memory management mechanisms were no longer sufficient, and we found ourselves needing to write additional low-level memory management code to ensure that memory was allocated and deallocated efficiently.
JIT Compiler and Performance Tuning
The Just-In-Time (JIT) compiler in the JVM offers dynamic compilation and optimization, but it also limits fine-tuned performance control. For Spark workloads, the generated Java code had to be optimized manually to make the most out of the JIT compiler's capabilities. This often meant optimizing loops to use SIMD instructions, tweaking memory access patterns, and other low-level optimizations that could not be automated.
Constrained Execution Model for Complex Data Types
The JVM-based engine also faced difficulties with handling complex, unstructured, or nested data types that are increasingly common in modern big data workloads. These data types often include large strings, un-normalized data, and nested structures, which require specialized memory layouts and processing mechanisms. The JVM's memory model and execution model were not optimized for such workloads, which led to inefficient memory usage and slower processing times.
Lack of Direct Access to Hardware Features
The JVM-based approach also made it difficult to take advantage of hardware-specific optimizations. Modern processors offer specialized instruction sets, such as SIMD and AVX (Advanced Vector Extensions), which can significantly improve performance for data-intensive operations. However, the JVM cannot directly leverage these hardware features, limiting the ability to fine-tune performance for specific hardware architectures.
Difficulty in Parallelism and Thread Management
While the JVM has built-in support for multi-threading, managing fine-grained parallelism for large-scale distributed systems like Spark proved difficult. The JVM’s threading model is optimized for general-purpose use, but for highly parallel workloads, such as large-scale data processing, the lack of direct control over thread scheduling and task management meant we could not fully optimize the workload.
Thus, Databricks embarked on the journey of building a new, more efficient native query engine, capable of seamlessly integrating with the existing ecosystem while providing a performance leap.
Moving Beyond the JVM: A Shift to Native Execution
One of the pivotal decisions they made was to move away from the JVM-based engine and implement a new execution engine using native code. This was a significant departure from the existing Databricks Runtime, which had been JVM-based. The shift posed a considerable challenge, as it required integrating the new native engine seamlessly with the rest of the runtime.
This decision was rooted in the fact that workloads were becoming increasingly CPU-bound, and squeezing additional performance from the JVM-based engine was becoming more difficult. Several factors contributed to this shift:
Optimized I/O Performance: Techniques like local NVMe SSD caching and auto-optimized shuffle had already significantly reduced I/O latency, minimizing one of the main bottlenecks in our workloads.
Advanced Query Optimization: Delta Lake's data clustering features enabled more aggressive query optimization through file pruning, allowing queries to skip unnecessary data more efficiently. This further reduced I/O wait times, but the in-memory execution on the JVM began to struggle as the system became more I/O-optimized.
Complex Data Workloads: The growing need to process un-normalized data, large strings, and unstructured nested data types put more strain on in-memory performance, highlighting the limitations of the JVM-based execution engine.
As a result,the team was clearly finding to hit the performance ceiling with the JVM's in-memory execution. To push beyond this, a deep understanding of JVM internals and manual tuning of the JIT compiler (for optimal code generation, such as SIMD instructions in loops) was required. This level of optimization was challenging and often left to engineers with specialized knowledge of JVM internals. Even with these adjustments, lower-level performance features, like memory pipelining and custom SIMD kernels, were out of reach within the JVM environment.
They also encountered performance issues in production, particularly with garbage collection. As stated in their Paper, when heap sizes exceeded 64GB—relatively small for modern cloud instances—the performance of garbage collection dropped significantly, forcing them to rely on manually managed off-heap memory. While this approach helped, it added complexity and made the code harder to maintain compared to native code.
Additionally, the JVM's execution engine had limitations when generating Java code. It frequently ran into constraints on generated method size and code cache size, causing it to fall back to much slower interpreted code paths, especially for wide tables (e.g., those with hundreds of columns, which are common in Lakehouse architectures). This became an impressive bottleneck in production.
After evaluating the engineering challenges and performance limitations within the JVM, the team concluded that a native query execution engine would allow to break through these ceilings. This shift not only offered better performance but also greater control over optimizations, ultimately leading to a more scalable and maintainable solution for our workloads.
Interpreted Vectorization vs. Code Generation: Approaching High-Performance Query Engines
Modern Online Analytical Processing (OLAP) systems typically rely on two primary approaches for engine design: interpreted vectorized execution and code generation. Both methods have their advantages and challenges, and Databricks had to consider both for the Photon engine.
Interpreted Vectorization: Simplicity and Flexibility
In an interpreted vectorized design, the query engine uses dynamic dispatch mechanisms (such as virtual function calls) to select the code to execute for a given input. However, instead of processing individual rows, these systems process data in batches, which allows for amortizing overheads such as virtual function calls. This design also enables efficient use of SIMD vectorization, where operations are executed in parallel, leveraging the CPU’s full processing power and memory hierarchy.
The MonetDB/X100 system and others like it have used this approach successfully. For the native engine prototype, both interpreted vectorization and code generation were considered. After careful evaluation, the interpreted vectorized model was selected for several reasons, both technical and practical.
Why Interpreted Vectorization Was Chosen
Easier to Develop and Scale
One of the initial challenges with code generation was its complexity. Dynamically producing the query execution code made debugging and development significantly harder. Unlike interpreted approaches, which can leverage existing tools like debuggers, code generation often requires manual instrumentation. This slowed down development and made it more error-prone.
Engineers spent months prototyping aggregation in a code-generating engine, but only a couple of weeks with the interpreted vectorized engine. The faster development with the interpreted approach was due to the ease of using existing tools and techniques like print debugging, which streamlined the process.
Better Observability
Code generation typically eliminates the overhead of interpretation by inlining operators into a small number of pipelined functions. While this improves performance, it comes at the cost of observability. In practice, when debugging a query, it becomes difficult to pinpoint how much time is spent within each operator, as they are fused into a single processing loop.
In contrast, the interpreted approach maintains clear abstraction boundaries between operators, which allows for better tracking of performance metrics. Each operator can maintain its own set of metrics, making it easier to identify bottlenecks and areas for optimization. This is especially critical once the engine is deployed in production environments, where performance issues may arise from customer workloads that cannot be easily reproduced by the developers.
Adaptability to Changing Data
The flexibility of the interpreted vectorized model is particularly valuable in dynamic environments like Lakehouse architectures, where traditional data constraints may not be available for all types of queries. The dynamic dispatch mechanism inherent in the interpreted model allows the engine to choose code paths at runtime, adapting to batch-level properties and varying query patterns.
Achieving similar adaptability in a code-generation approach would have been more difficult. It would have required compiling numerous branches at runtime or dynamically recompiling parts of the query, adding significant overhead to query execution time and memory usage. Even leading code-generating systems like HyPer use interpreters to circumvent these costs in certain scenarios.
Specialization Without Complexity Overload
While code generation offers clear performance advantages in some scenarios—such as simplifying complex expression trees using compiler optimizations like common sub-expression elimination and pruning unused columns via dead-store elimination—similar performance gains could be achieved through specialization in the interpreted vectorized approach.
For example, one of the most common expressions in customer queries is the "between" operation (e.g.,
col >= left AND col <= right
). In a code-generating engine, this would often result in high interpretation overhead. However, by creating a specialized fused operator for the "between" expression, the overhead was significantly reduced without resorting to complex code generation. This approach allowed for maintaining performance while keeping the system simpler and easier to scale.
Databricks’ engineers recognized the merits of both approaches. However, they ultimately chose the vectorized approach for Photon, primarily due to its more manageable debugging process and flexibility to adapt to different data types. The flexibility offered by vectorized processing is crucial for the Lakehouse, where data statistics may not always be available and workloads can vary drastically.
Why Photon Chooses Vectorized Execution
Photon’s decision to implement a vectorized execution engine was driven by multiple valid points:
Adaptability to Data Properties: Photon can dynamically choose code paths based on the type of input data, allowing it to perform optimally even in cases where constraints or statistics are unavailable. This adaptability is a critical feature for Lakehouse workloads, which often involve diverse data sources and formats.
Performance Optimization: Vectorized engines allow for more efficient pipelining and better utilization of CPU caches. By processing data in contiguous memory blocks (columnar format), Photon achieves better SIMD usage, making it suitable for the kinds of large, diverse datasets that Lakehouse workloads generate.
Maintaining Flexibility and Debugging Ease: Vectorized execution maintains a clear distinction between operators, which is essential for monitoring performance metrics. In contrast to code generation, where operator collapsing can obscure performance data, the vectorized approach enables better visibility into execution and easier debugging.
Row vs. Column-Oriented Execution: The Shift to Columnar Representation
One of the key changes Photon made was shifting from a row-oriented data model to a columnar data model. In traditional systems like Spark SQL, data was represented in memory in a row-oriented format, meaning that records were stored in memory row by row. This model works well for transactional systems but is inefficient for analytics, particularly when dealing with large-scale datasets in a data lake.
In contrast, Photon adopts a columnar in-memory representation, which is ideal for modern OLAP workloads. This change has several advantages:
Efficiency in Data Scanning: Columnar storage allows for more efficient scanning of data, especially when only a subset of columns is needed for a query. This is particularly beneficial for analytical workloads, where queries often require aggregations and filtering on specific columns.
Optimized for SIMD: Columnar data layout is more suitable for SIMD operations. Since data for each column is stored contiguously in memory, operations on columns can be vectorized, dramatically improving performance.
Efficient Disk I/O: Since data is stored in a columnar format on disk (e.g., in Parquet files), Photon eliminates the need for column-to-row pivoting when reading data from disk. This reduces I/O overhead and improves the overall efficiency of data processing.
Ensuring Semantics Consistency between Photon and Spark
Ensuring that Photon’s behavior remains consistent with Apache Spark’s is a critical challenge in integrating the two systems. Since the same query expression can run in either Photon or Spark depending on the compatibility of various parts of the query, the results must be identical to maintain the integrity of the system. This consistency is especially important because differences in how certain operations are executed in Java (Spark) versus C++ (Photon) can lead to discrepancies in the output, even for seemingly simple operations.
For example, integer-to-floating point conversions are implemented differently in Java and C++, which may lead to slightly different results in some edge cases. Additionally, time-based expressions that rely on the IANA Timezone Database can also behave inconsistently across the two engines. The version of the database shipped with the JVM may differ from the one used by Photon, potentially leading to mismatched time zone calculations.
To address these issues, a comprehensive testing strategy is employed to ensure consistency. This includes:
Unit Tests
Unit tests explicitly target SQL expressions to ensure that functions like upper()
, sqrt()
, and other transformations behave identically in both Spark and Photon. A custom framework in native code tests SQL expressions by loading input values into column vectors and comparing the output of the expressions in both engines. This framework allows for testing various scenarios, such as those involving NULL values or inactive rows.
In addition, Photon integrates with Spark’s open-source expression unit tests. These tests are executed on both engines, and any discrepancies are flagged. The unit tests contributed by the community and Databricks ensure broad coverage for common SQL expressions.
End-to-End Tests
End-to-end tests run full SQL queries against both Photon and Spark to verify that the results are consistent. These tests are designed to cover a wide range of query types and edge cases. A dedicated set of tests is run exclusively when Photon is enabled, ensuring that the integration of Photon-specific plan transformations does not introduce errors, such as memory corruption or inconsistent data. Additionally, enabling Photon in the full suite of Spark SQL tests provides further assurance of compatibility and highlights any potential discrepancies.
Fuzz Tests
Fuzz testing plays a crucial role in detecting subtle bugs and ensuring compatibility between Photon and Spark. Random queries are generated and executed against both systems, comparing the results to catch inconsistencies. Specialized fuzz testers are employed for specific features, such as the handling of decimal types. Photon’s decimal implementation differs from Spark’s in that it allows operations over inputs of different types for performance reasons, whereas Spark typically casts all decimal inputs to the output type before performing operations. This difference can lead to performance improvements but also introduces slight behavioral differences, which are tracked and tested using a behavior whitelist to ensure that any legitimate deviations do not lead to incorrect results.
Your Turn: What’s Your Experience with Databricks’ Spark?
As data engineers, architects, and practitioners, we are constantly evaluating and choosing tools and systems to meet our performance and scalability needs. With Photon’s rapid advancements, it’s an exciting time for the data engineering community. But what does the future hold for Apache Spark within the open-source ecosystem? Will Databricks maintain its competitive edge, or will the open-source community eventually catch up?
I’d love to hear your thoughts in the comments—let’s spark a discussion on what the future might look like for Spark, Photon, and the broader data engineering landscape.
Thanks for reading my friends! If you have any feedback or suggestions, feel free to reach out. Stay tuned for more insights into cutting-edge data engineering topics!