Why you should use Presto for ad hoc analytics

Presto! It’s not only an incantation to excite your audience after a magic trick, but also a name being used more and more when discussing how to churn through big data. While there are many deployments of Presto in the wild, the technology — a distributed SQL query engine that supports all kinds of data sources — remains unfamiliar to many developers and data analysts who could benefit from using it.

In this article, I’ll be discussing Presto: what it is, where it came from, how it is different from other data warehousing solutions, and why you should consider it for your big data solutions.

Presto vs. Hive

Presto originated at Facebook back in 2012. Open-sourced in 2013 and managed by the Presto Foundation (part of the Linux Foundation), Presto has experienced a steady rise in popularity over the years. Today, several companies have built a business model around Presto, such as Ahana, with PrestoDB-based ad hoc analytics offerings.

Presto was built as a means to provide end-users access to enormous data sets to perform ad hoc analysis. Before Presto, Facebook would use Hive (also built by Facebook and then donated to the Apache Software Foundation) in order to perform this kind of analysis. As Facebook’s data sets grew, Hive was found to be insufficiently interactive (read: too slow). This was largely because the foundation of Hive is MapReduce, which, at the time, required intermediate data sets to be persisted to HDFS. That meant a lot of I/O to disk for data that was ultimately thrown away. 

Presto takes a different approach to executing those queries to save time. Instead of keeping intermediate data on HDFS, Presto allows you to pull the data into memory and perform operations on the data there instead of persisting all of the intermediate data sets to disk. If that sounds familiar, you may have heard of Apache Spark (or any number of other technologies out there) that have the same basic concept to effectively replace MapReduce-based technologies. Using Presto, I’ll keep the data where it lives (in Hadoop or, as we’ll see, anywhere) and perform the executions in-memory across our distributed system, shuffling data between servers as needed. I avoid touching any disk, ultimately speeding up query execution time.

How Presto works

Different from a traditional data warehouse, Presto is referred to as a SQL query execution engine. Data warehouses control how data is written, where that data resides, and how it is read. Once you get data into your warehouse, it can prove difficult to get it back out. Presto takes another approach by decoupling data storage from processing, while providing support for the same ANSI SQL query language you are used to.

Copyright © 2020 IDG Communications, Inc.

Source link