PrestoDB is a fast analytic SQL engine for querying big data of any size. Presto was developed by Facebook in 2012 to run interactive queries against their Hadoop/HDFS clusters and later on they made Presto project available as open source under Apache license. Earlier to PrestoDb, Facebook has also created Hive query engine to run as interactive query engine but Hive was not optimized for high performance.
Presto Query engine can run on top of many relational and nonrelational sources such as HDFS, Cassandra, MongoDB and many more.Side image represents all sources on top of which we can run prestoDB query engine.
Many renowned companies is using PrestoDB currently in their production environment for analyzing their big data development namely Facebook, AirBNB, Netflix, Nasdaq ,Atlassian and many more.Facebook runs over 30,000 queries, processing around petabytes of data daily and netflix runs around 3,500 queries per day.
However, Facebook introduced Presto after Hive but it is not replacement for hive because both have different use cases.
|Designed for short interactive queries.||Designed for Batch processing.|
|10-30X faster||Low performance|
|In memory architecture, keeps data in memory. No mapreduce jobs are run.||Hive uses Mapreduce jobs in the background.|
|Not suitable for large workloads because of in memory processing||Suitable for large workloads and long running transformations.|
Presto is distributed parallel processing framework with a single coordinator and multiple workers. Presto client submits SQL query to coordinator, which parses the SQL queries, analyzes it and then finally schedules it on multiple workers. Presto has support for multiple connectors such as Hbase, Hive, MongoDB, Cassandra and many more to get metadata for building queries. You can create your own custom connector as well. Worker nodes finally get the actual data from connectors again to execute the query and finally delivers result to the client.
Presto supports complex queries, joins, windows and aggregations.Presto executes queries in memory without transfering the data from its actual source thus contributing to faster execution by avoiding unneccessary I/O.
AWS is ideal choice for setting up presto cluster because of high availability, scalaibility, reliability and cost effectiveness and you can launch presto clusters in minutes on Amazon cloud.For that matter, Amazon EMR and Amazon Athena are best way to deploy presto in Amazon cloud. Amazon Athena allows deploying presto cluster without doing any node provisioning, cluster tuning or configuration because it deploys presto using AWS serverless platform. With Amazon Athena, You have to simply point to your data in Amazon S3, define its schema and start doing analytics on top of it. Moreover, you have to pay for what you use which means you only pay for the time your queries run.
Login to your aws account using your aws credentials and from services tab, select Athena under Analytics section that will take you to Athena console.
Athena console will allows you to select your data for analytics from Amazon S3 and it supports to read data in many formats such as CSV, TSV, Json, Parquet and ORC format. Moreover, you can create schema over that data and then finally query that data using SQL/Hive queries to find some insights.Select get started button to move further.
This Amazon Athena editor can run interactive query(SQL / Hive DDL) on data stored in Amazon S3, without the need for clusters or data warehouses but before running any query, we need to setup Amazon S3 location to store query results and metadata information for each query.So click on“ setup a query result location in Amazon S3” link.
After you click on the link, this setting window will appear where you have to provide S3 bucket location in which you want to store your query results, also you can choose to encrypt query results.I have created a bucket named “athena-query-res” and a folder named “athena” under that bucket to save my query results.Click on save to continue.
I have uploaded employee.txt, which contains above data (name, designation and age) to my S3 bucket for analysis. We will create table on top of this data using Athena editor.
Next step is to create database and create table to represent my data. In above snapshot, I created a database called “sampledb”, then created table called employee to represent my data that I have uploaded on S3 in earlier step.I have used Hive DDL to create an external table, pointed to S3 location, and finally selected all data to view in editor.
Since my data & schema is in place, now I can fire any query on top of that data to perform some analytics using either SQL or HQL. In above example, I fired a query to find out all employees whose designation is manager.Similarly, you can perform aggregation, joins, window operation on top of this data.
You can also view previously run queries and their status using the history tab. Using history tab, you can view error details, query run time, query status whether it failed or succeeded, query submit time and can also download query results.Though we have taken very small data for querying but presto can be used on petabytes of data as well.
AWS glue provides a unified metadata repository across various data source & formats such as RDS, Redshift, and Athena etc.We can integrate AWS glue with presto to serve as data metastore. AWS Glue can automatically infer schema from source data in Amazon S3 and store the associated metadata in the Data Catalog.
Presto is blurring the boundary of analytics on relational & non-relational data source by supporting both in the same manner henceforth making its mark in the market very quickly. Adoption of Presto by AWS has made it even more viable for companies moving to cloud infrastructure.