PrestoDB is a fast analytic SQL engine for querying big data of any size. Presto was developed by Facebook in 2012 to run interactive queries against their Hadoop/HDFS clusters and later on they made the Presto project available as open-source under Apache license. Earlier to PrestoDb, Facebook has also created a Hive query engine to run as an interactive query engine but Hive was not optimized for high performance.
Presto Query engine can run on top of many relational and nonrelational sources such as HDFS, Cassandra, MongoDB and many more.Side image represents all sources on top of which we can run prestoDB query engine.
Many reputed companies are currently using PrestoDB in their production environment for analysis of their big data development i.e. Facebook, Airbnb, Netflix, Nasdaq, Atlassian and many more. Facebook runs over 30,000 queries, processing around petabytes of data daily and NetFlix runs around 3,500 queries per day.
However, Facebook introduced Presto after Hive but it is not a replacement for hive because both have different use cases.
|Designed for short interactive queries.||Designed for Batch processing.|
|10-30X faster||Low performance|
|In memory architecture, keeps data in memory. No mapreduce jobs are run.||Hive uses Mapreduce jobs in the background.|
|Not suitable for large workloads because of in memory processing||Suitable for large workloads and long running transformations.|
Presto is distributed parallel processing framework with a single coordinator and multiple workers. Presto client submits SQL query to a coordinator, which parses the SQL queries, analyzes it and then finally schedules it on multiple workers. Presto has support for multiple connectors such as Hbase, Hive, MongoDB, Cassandra and many more to get metadata for building queries. You can create your custom connector as well. The working nodes eventually retrieve the actual information from the connectors to run the query and finally deliver the result to the client.
Presto supports complex queries, joins, windows and aggregations.Presto executes queries in memory without transferring the data from its actual source thus contributing to faster execution by avoiding unnecessary I/O.
AWS is an ideal choice for setting up presto clusters because of high availability, scalability, reliability and cost-effectiveness and you can launch Presto clusters in minutes on the Amazon cloud. For that matter, Amazon EMR and Amazon Athena are the best way to deploy presto in the Amazon cloud. Amazon Athena allows deploying Presto cluster without doing any node provisioning, cluster tuning, or configuration because it deploys presto using AWS serverless platform. With Amazon Athena, You have to easily point to your data in Amazon S3, define its schema and start doing analytics on top of it. Moreover, you have to pay for what you use, which means you only pay for the time your queries run.
Login to your AWS account using your AWS credentials and from the services tab, select Athena under the Analytics section that will take you to Athena console.
Athena console will allow you to select your data for analytics from Amazon S3 and it supports reading data in many formats such as CSV, TSV, JSON, Parquet and ORC format. Moreover, you can create schema over that data and then finally query that data using SQL/Hive queries to find some insights. Select the get started button to move further.
This Amazon Athena editor can run the interactive query(SQL / Hive DDL) on data stored in Amazon S3, without the need for clusters or data warehouses but before running any query, we need to set up Amazon S3 location to store query results and metadata information for each query. So click on the “ setup a query result location in Amazon S3” link.
After you click on the link, this setting window will appear where you have to provide the S3 bucket location in, which you want to store your query results, also you can choose to encrypt query results. I have created a bucket named “Athena-query-res” and a folder named “Athena” under that bucket to save my query results.Click on Save to continue.
I have uploaded employee.txt, which contains the above data (name, designation and age) to my S3 bucket for analysis. We will create a table on top of this data using Athena editor.
The next step is to create a database and create a table to represent my data. In the above snapshot, I created a database called “sampledb”, then created a table called the employee to represent the data that I have uploaded on S3 in the earlier step. I have used Hive DDL to create an external table, pointed to the S3 location, and finally selected all data to view in the editor.
Since my data & schema are in place, now I can fire any query on top of that data to perform some analytics using either SQL or HQL. In the above example, I fired a query to find out all employees whose designation is the manager. Similarly, you can perform aggregation, joins, window operations on top of this data.
You can also view previously run queries and their status using the history tab. Using the History tab, you can also view error details, query runtime, query status whether failed or succeeded, query submission time and download query results. Although we have taken very small data to query, presto can also be used on petabytes of data.
AWS glue provides a unified metadata repository across various data sources & formats such as RDS, Redshift, and Athena, etc. We can integrate AWS glue with presto to serve as a data megastore. AWS Glue can automatically infer schema from source data in Amazon S3 and store the associated metadata in the Data Catalog.Conclusion
Presto is blurring the boundary of analytics on relational & non-relational data sources by supporting both in the same manner henceforth making its mark in the market very quickly. The adoption of Presto by AWS has made it even more viable for companies moving to cloud infrastructure.