self i ‐ ntroducton • @toyama0919 • Analytcs Infra. • Nearly working embulk… • Presto using one and a half years.
Our development situaton • We commonly used sql. • Marketng occupaton don't write sql. • I often write the complicated SQL, that is 100 lines.. • We love OSS. • Not use Update, Insert, Delete by Presto.
Our Business situaton • We manage and operate web site of BtoB. • Our data lifecycle is long. • Business side not write sql. • watching re:dash and Adobe analytcs. • Sales increase 15 straight year.
analyst want data quickly
Visualize Data Store Batch Collect Ru R by b y Bat Ba ch t (Digdag)
Analytcs Priolity 1. Direct SQL 2. Presto 3. ETL
Cost is large difference from 1 to 3
Why use presto? • Cross server Join • Window functon • UDF
Cross Server Join
Join • Cross server and cross database. • A single Presto query can combine data from multple sources. • We use multple sources join query. • reduce ETL pain.
Collect data in one place? • Equal able to get data by one query. • I not want to have duplicate data.(master data, user data) • Collect the data in one place, high develop cost.
with mysql_user as ( select user_id, user_name from mysql.schema.users ), redshift_user_log as ( Parallel select user_id, log_tme from redshift.schema.pageview ) select user_id, user_name, count(*) from mysql_user inner join redshift_user_log on mysql_user.user_id = redshift_user_log.user_id group by user_id, user_name
Mysql not support mechanism • window functon • with query – not support Recursive. • URL functon • Array data type • cross join unnest
Array type select split(keywords, ',') as keywords From mysql_keywords_table keywords ---------------------------- keyword1,keyword2,keyword3 keywords ---------------------------- ['keyword1','keyword2','keyword3']
horizontal to vertcal SELECT keyword FROM mysql_keywords_table CROSS JOIN UNNEST(split(keywords, ',')) AS t (keyword)
Prestogres • PostgreSQL protocol gateway for Presto. • rewrite queries before sending Presto to PostgreSQL. • have password-based authentcaton and SSL.
Why Prestogres? • Other applicaton connectvity. – pgAdmin, psql command. – re:dash connecte with PostgreSQL protocol to presto. – But can directly connect to presto. • We connect to presto, need Presto client. – I not want use java client. • Weak security. – certficaton is taken by prestogres
Prestogres Limitaton • prepared statement. – not support Presto too. – so not work embulk-input-postgresql • Can’t fetch schema by sql. • Temporary table • DROP TABLE
re:dash • Visualizaton platform, write by python. • Supports many data sources. • Sharing query with member. • Scheduling query.(per day, per hour) • Very actve contributon.
increased rapidly Presto query by re:dash • Number of the presto queries increased than 10 tmes. • That won't change with writng ETL on re:dash. • Re:dash having a good reputaton in internal.
Okay, analytcs problems all clear!
No.. Can’t escape from ETL
Embulk with Presto • use embulk-input-presto of own making. – Support json type. • Create point in tme data. • Create machine learning data.
Why Embulk? • Very actve plugin ecosystem. • Complicated string analysis can not only sql. • With digdag combinaton is very powerful. • Want can do it shortest distance. • Fluentd overwork..
Install by RPM • Presto have RPM. – not distributon. – need source build.. • include init script. • But not support open-jdk.. – Pull requestng..
AWS integraton • We build Presto on ec2. • Not use EMR. • Worker is spot instance, mult instance types. – prevent down all at once
networking • Presto cluster(coordinator and workers) place in the same AZ. • If other AZ, very high traffic cost(and money). – should not mult AZ.
Networking on AWS wor w k or e k r Not wo w r o ke k r cor c d or inat a or t Cluster wo w r o ke k r Availability Zone Availability Zone
problem • Very huge repository. • SPOF cordinator. • run long range query, occur OutOfMemory Error.
Very huge repository • monolithic applicaton. – I want Separate repository. • First build takes 30 minutes. • After the second tme build takes 10 minutes. • All connector is main repository. – MongoDB、Kafka、cassandra.. – wil nearly support Elastcsearch • Hard to do the contributon.
Big change for jdbc • support mult data type predicate pushdown. • We used apply patch presto… • Let's try mysql people.
listened Presto impression • extended technology of Hadoop. =>I don't know hadoop. Presto have many connector. • parallel processing looks difficult. =>Presto not have storage, There is not so influence. ・I do not have so big data. =>I don't so big player.
Summary • Presto is great software. • So not difficult. • Let's use it more.