Solving query optimization in Presto

“SQL on everything” is the tagline associated with Presto, the query engine that was initially developed by Facebook to rapidly analyze massive amounts of data — particularly data that lay scattered across multiple formats and sources. Since its release as an open source project in 2013, Presto has been adopted broadly across hundreds of enterprises. Today, a strong worldwide community contributes to its ongoing development.

A decade or so previously, the traditional approach for a company to handle its data processing needs was to set up a data center, stock it with CPUs and hard drives, and acquire all of the relevant software to tame, store, and analyze the data. This also required an investment in several software licenses and associated service contracts. These data services tended to be used in bursts — i.e., the beginning of the week and end of the quarter handled a lot more traffic than other times. But since these resources were statically allocated, they had to be provisioned for peak usage and left under-utilized the rest of the time. Additionally, companies would need to staff a team of engineers to keep this setup operational, ensure high availability, and troubleshoot various use cases.

Elastic cloud economics is the tectonic shift in this industry that now allows enterprises to pay only for the resources they use. They can tap low-cost data storage services provided in the cloud, such as Amazon S3, and dynamically provision data processing workhorses in the form of virtual servers that closely match the size of the varying workload.

This decoupling of storage and compute allows users to seamlessly resize their compute resources. Query engines like Presto work well in this auto-scaling context, and they are seeing increased adoption as more enterprises move data to the cloud. Presto has an extensible, federated design that allows it to read and process data seamlessly from disparate data sources and file formats.

While Presto’s federated architecture is quite beneficial in being able to process data in place, it engenders significant complexity in generating an optimal execution plan for a query. The rest of this article will explain why generating an optimal query execution plan is a hard problem for Presto and express a view on the way forward.

The evolution of the query optimizer 

First, let me take a step back and describe the generic problem and some of the solutions that have been developed over the past several decades. Query optimizers are responsible for converting SQL, expressed declaratively, to an efficient sequence of operations that may be performed by the engine on the underlying data. As such, query optimizers are a critical component of databases.  

The input to a query optimizer is a “logical plan,” which itself is the result of parsing the input SQL and converting it to a high-level collection of the operations required to execute the query. The optimizer then works on converting the logical plan into an efficient execution strategy depending on the operators available to the query engine and the characteristics of the data layout.

For a typical SQL query, there exists one logical plan but many strategies for implementing and executing that logical plan to produce the desired results. The optimizer is responsible for choosing the “best” execution plan among these candidates. In fact, the pool of potential candidates (the search space) that the optimizer must sift through is so large that identifying the optimal query plan among them is an “NP-hard problem.” This is another way of saying that computers cannot solve this problem in any reasonable time frame.

Most query optimizers use a set of heuristic rules to curtail the search space. The goals include minimizing the data read from disk (predicate and limit pushdown, column pruning, etc.) and reducing network transfer of data (reshuffle), with the ultimate aim of fast query execution and fewer resources used. In its early version, Presto’s query optimizer was a set of rules that would operate on, and mutate, the logical plan until a fixed point is reached.

Let us try to understand what this looks like using a concrete example. The example below, picked from a suite of ad hoc queries, is part of the commonly used decision support benchmark, TPC-H. TPC-H Q3, the Shipping Priority Query, is a query that retrieves the 10 unshipped orders with the highest value.

SELECT
       l_orderkey,
       SUM(l_extendedprice * ( 1 - l_discount )) AS revenue,
       o_orderdate,
       o_shippriority
FROM  
       customer,
       orders,
       lineitem
WHERE 
       c_mktsegment = 'AUTOMOBILE'
       AND c_custkey = o_custkey
       AND l_orderkey = o_orderkey
       AND o_orderdate < DATE '1995-03-01'
       AND l_shipdate > DATE '1995-03-01'
GROUP  BY l_orderkey,
          o_orderdate,
          o_shippriority
ORDER  BY
         revenue DESC,
          o_orderdate;

This query performs a three-way join between the data tables customer, orders, and lineitem (join keys custkey and orderkey) and narrows the results set by applying a set of filters (c_mktsegment, o_orderdate, l_shipdate). The query then calculates an aggregate SUM by grouping on each distinct combination of l_orderkey, o_orderdate, and o_shippriority and orders the result set by descending order of the computed column (revenue) and orderdate.

The naive approach

The naive approach to optimizing this query would perform a full cross join on the three tables (a Cartesian product), eliminate from this set all the tuples that do not satisfy the filters in the WHERE clause, then perform the aggregation by identifying each unique combination of l_orderkey, o_orderdate, and o_shippriority and calculate the SUM(l_extendedprice * ( 1 - l_discount )), and finally order the result set on o_orderdate. This sequence of operations, while guaranteed to produce accurate results, will not work for even a moderate size dataset in most hardware. The Cartesian product would produce a huge intermediate result set that is beyond the main memory limits of most servers. It is also inefficient to read all the data from disk for all three tables while the query is only interested in specific tuples that satisfy the constraints described in the predicates.

The rule-based optimizer (RBO)

This framework mitigates some of the problems in the naive approach. To illustrate, it can generate a plan in which the predicates are applied while the data is read for each table. Therefore, while materializing tuples for table customer, only the records that match c_mktsegment = 'AUTOMOBILE' would be realized. Likewise only records satisfying o_orderdate < DATE '1995-03-01' for table orders and records satisfying l_shipdate > DATE '1995-03-01' for table lineitem would be read from disk. This already reduces the size of the intermediate result set by several orders of magnitude.

The RBO would also never suggest a Cartesian product of all three tables for the intermediate result in this case. It would instead first perform a join between two tables, e.g. customer and orders, and only retain the tuples that match the predicate c_custkey = o_custkey, and then perform another join between this intermediate result set and the lineitem table.

There are two advantages to the RBO methodology. The first advantage is the greatly reduced memory required to compute this join since it aggressively applies filters to prune out tuples that are not of interest. The second advantage is the enabling of efficient algorithms to process this join, such as the commonly used hash join. Briefly, this is an algorithm in which a hash table can be built out of the join keys of the smaller table.

For example, while joining customer with orders, a hash table is built on customer.c_custkey, and then for the records in orders, only the records where orders.o_custkey exists in the hash table are read. The hash table is built from the smaller input to the join because this has a higher chance of fitting in memory, and materializes only the tuples that are necessary for each join. (Pushing the aggregation below the join is another advanced optimization technique, but is beyond the scope of this article.)

The cost-based optimizer (CBO)

The next step in the evolution of query optimizers was the advent of cost-based optimization. If one knew some characteristics of the data in the tables, such as minimum and maximum values of the columns, number of distinct values in the columns, number of nulls, histograms depicting distribution of column data, etc., these could have a major impact in some of the choices the optimizer would make. Let us walk through this with the TPC-H Q3 benchmark query discussed above.

The CBO continues from the fixed point reached by the RBO. To improve on the RBO, it would be useful to determine the size of the inputs to the joins in order to decide which input should be used to build the hash table. If it was a join between two tables on join keys, with no additional predicates, the RBO typically has knowledge of the row counts of the tables and can choose the hash input appropriately. However, in most cases, there are additional predicates that determine the size of the join input.

For example, in the Q3 query there is customer.c_mktsegment = 'AUTOMOBILE' on customer and orders.o_orderdate < DATE '1995-03-01' on orders. The CBO relies on the engine to provide it with the selectivities of these predicates on the tables, and then uses these to estimate the size of the join inputs. Therefore, even though the customer table may be smaller than the orders table, once the filters are applied, the number of records flowing into the join from the orders table may actually be fewer. A CBO can also propagate through the plan the estimated cardinality of certain operations such as joins or aggregations and use these to make intelligent choices in other parts of the query plan.

Query optimization in Presto vs. traditional databases

Cost-based optimization is not easy to realize in Presto. Traditional databases do not decouple storage and compute, and typically do not interface with any sources of data other than those which have been ingested. As such, these databases can analyze and store all the relevant statistics about their datasets. In contrast, Presto operates on data lakes where the data could be ingested from disparate processes and the scale of the data is several orders of magnitude larger.

Keeping data lake statistics accurate and updated requires a huge commitment of resources and is very hard to maintain — as many enterprises have discovered. Furthermore, as a data federation engine, Presto interfaces, oftentimes simultaneously, with multiple datastores that have very different characteristics. These data sources could range from a straightforward filesystem (HDFS) to continuous streaming data systems (Pinot, Druid) to mature database systems (Postgres, MySQL). And more of these connectors are being added frequently.

To create a unified cost model that covers all these data sources and enables the optimizer to quantitatively reason about tradeoffs in access costs across these sources is an incredibly hard problem. The lack of reliable statistics and the lack of a unified cost model have caused major enterprises to completely ignore cost-based optimization in Presto, even though the community has developed limited support for some CBO features like join reordering.

A smarter query optimization plan for Presto

It is my opinion that instead of expending effort to recreate what worked for traditional RDBMSs, namely painstakingly reconstructing a federated cost model for Presto, it would be more beneficial to explore alternative approaches to solving this problem. Old-school approaches to query optimization have typically considered the problem in a static context — i.e., the optimizer works independently to come up with an optimal plan that is then handed off to the scheduler or execution engine of the DBMS.

Adaptive query execution is a paradigm that removes the architectural distinction between query planning and query execution. It is a framework in which the execution plan is adaptive: It can monitor the data being processed and can change shape depending on the characteristics of the data flowing through the operators in the plan.

Adaptive query execution typically takes as input a query plan that is produced as the result of heuristic/rule-based optimization, and then reorders operators or replans subqueries based on run-time performance. Such an approach would partition a query plan into subplans that may be independently altered. During query execution, the system monitors and detects a suboptimal subplan of a query based on key performance indicators and dynamically replans that fragment. The engine could also potentially reuse the partial results that had been produced by the old subplan or operators and avoid redoing the work. These suboptimal plan fragments may be reoptimized several times throughout query execution, with a careful tradeoff between opportunistic re-optimization and the risk of producing an even less optimal plan.

Source link

Exit mobile version