Understanding Presto Presto meetup @ Tokyo #1 Sadayuki Furuhashi Founder & Software Architect Treasure Data, inc.
A little about me... > Sadayuki Furuhashi > github/twitter: @frsyuki > Treasure Data, Inc. > Founder & Software Architect > Open-source hacker > MessagePack - Efficient object serializer > Fluentd - An unified data collection tool > Prestogres - PostgreSQL protocol gateway for Presto > Embulk - A bulk data loader with plugin-based architecture > ServerEngine - A Ruby framework to build multiprocess servers > LS4 - A distributed object storage with cross-region replication > kumofs - A distributed strong-consistent key-value data store
Discovery Service HDFS / Metastore Worker Hive Connector Client Coordinator Worker PostgreSQL JDBC Connector Worker Other data sources... other connectors ...
JOIN HDFS / Metastore PostgreSQL Presto select orderkey, orderdate, custkey, email from orders MySQL join mysql.presto_test.users on orders.custkey = users.id order by custkey, orderdate;
INSERT INTO JOIN HDFS / Metastore PostgreSQL Presto create table mysql.presto_test.recent_user_info as MySQL select users.id, users.email, count(1) as count from orders join mysql.presto_test.users on orders.custkey = users.id group by 1, 2;
1. Distributed & Plug-in architecture > 3 type of servers > Coordinator, Worker, Discovery server > Get data/metadata through connector plugins. > Presto is state-less (Presto is NOT a database). > Presto can provide distributed SQL to any data stores. • connectors are loosely-coupled (may cause some overhead here) > Client protocol is HTTP + JSON > Language bindings: Ruby, Python, PHP, Java, R, etc. > ODBC & JDBC support by Prestogres > https://github.com/treasure-data/prestogres
Other Presto’s features > Comprehensive SQL features > WITH cte as (SELECT …) SELECT * FROM cte …; > implicit JOIN (join criteria at WHERE) > VIEW > INSERT INTO … VALUES (1,2,3) > Time & Date types & functions compatible both MySQL & PostgreSQL > Culster management using SQL > SELECT * FROM sys.node; > sys.task, sys.query
2. Query Planning
Presto’s execution model > Presto is NOT MapReduce > Presto’s query plan is based on DAG > more like Spark or traditional MPP databases
MapReduce vs. Presto MapReduce Presto All stages are pipe-lined ✓ No wait time reduce reduce task ✓ No fault-tolerance disk Wait between map map stages task task disk memory-to-memory data transfer reduce reduce task task ✓ No disk IO ✓ Write data Data chunk must disk to disk fit in memory task task map map
Query Planner Distributed query plan SQL Logical query plan Output SELECT Exchange name, Output count(*) AS c (name, c) FROM access Sink GROUP BY name Final aggregation GROUP BY + Exchange (name, count(*)) Table schema Sink TABLE access ( Table scan name varchar (name:varchar) Partial aggregation time bigint Table scan )
Query Planner - Stages Output Stage-0 inter-worker Exchange data transfer Sink pipelined Final aggregation Stage-1 aggregation Exchange inter-worker Sink data transfer Stage-2 Partial aggregation Table scan
2. Query Planning > SQL is converted into stages, tasks and splits > All tasks run in parallel > No wait time between stages (pipelined) > If one task fails, all tasks fail at once (query fails) > Memory-to-memory data transfer > No disk IO > If hash-partitioned aggregated data doesn’t fit in memory, query fails • Note: Query dies but worker doesn’t die. Memory consumption is fully managed.