Solving query optimization in Presto

“SQL on everything” is the tagline associated with Presto, the query engine that was initially developed by Facebook to rapidly analyze massive amounts of data — particularly data that lay scattered across multiple formats and sources. Since its release as an open source project in 2013, Presto has been adopted broadly across hundreds of enterprises. Today, a strong worldwide community contributes to its ongoing development.

A decade or so previously, the traditional approach for a company to handle its data processing needs was to set up a data center, stock it with CPUs and hard drives, and acquire all of the relevant software to tame, store, and analyze the data. This also required an investment in several software licenses and associated service contracts. These data services tended to be used in bursts — i.e., the beginning of the week and end of the quarter handled a lot more traffic than other times. But since these resources were statically allocated, they had to be provisioned for peak usage and left under-utilized the rest of the time. Additionally, companies would need to staff a team of engineers to keep this setup operational, ensure high availability, and troubleshoot various use cases.

Elastic cloud economics is the tectonic shift in this industry that now allows enterprises to pay only for the resources they use. They can tap low-cost data storage services provided in the cloud, such as Amazon S3, and dynamically provision data processing workhorses in the form of virtual servers that closely match the size of the varying workload.

This decoupling of storage and compute allows users to seamlessly resize their compute resources. Query engines like Presto work well in this auto-scaling context, and they are seeing increased adoption as more enterprises move data to the cloud. Presto has an extensible, federated design that allows it to read and process data seamlessly from disparate data sources and file formats.

While Presto’s federated architecture is quite beneficial in being able to process data in place, it engenders significant complexity in generating an optimal execution plan for a query. The rest of this article will explain why generating an optimal query execution plan is a hard problem for Presto and express a view on the way forward.

The evolution of the query optimizer 

First, let me take a step back and describe the generic problem and some of the solutions that have been developed over the past several decades. Query optimizers are responsible for converting SQL, expressed declaratively, to an efficient sequence of operations that may be performed by the engine on the underlying data. As such, query optimizers are a critical component of databases.  

Source link