Fluentd / Embulk For reliable transfer Masahiro Nakagawa Apr 18, 2015 Game Server meetup #4
Who are you? > Masahiro Nakagawa > github/twitter: @repeatedly > Treasure Data, Inc. > Senior Software Engineer > Fluentd / td-agent developer > Living at OSS :) > D language - Phobos committer > Fluentd - Main maintainer > MessagePack / RPC - D and Python (only RPC) > The organizer of several meetups (Presto, DTM, etc…) > etc…
Why JSON / MessagePack? (1 > Schema on Write (Traditional MPP DB) > Writing data using schema for improving query performance > Pros > minimum query overhead > Cons > Need to design schema and workload before > Data load is expensive operation
Why JSON / MessagePack? (2 > Schema on Read (Hadoop) > Writing data without schema and map schema at query time > Pros > Robust over schema and workload change > Data load is cheap operation > Cons > High overhead at query time
Core Plugins > Divide & Conquer > Read / receive data > Parse data > Buffering & Retrying > Filter data > Buffer data > Error handling > Format data > Message routing > Write / send data > Parallelism
Core Plugins > Divide & Conquer > Read / receive data > Parse data > Buffering & Retrying Common Use Case > Filter data > Buffer data > Error handling Concerns Specific > Format data > Message routing > Write / send data > Parallelism
Event structure(log message) ✓ Time ✓ Record > default second unit > JSON format > from data source > MessagePack internal y ✓ Tag > schema-free > for message routing > where is from?
Configuration and operation > No central / master node > @include helps configuration sharing > Operation depends on your environment > Use your deamon / deploy tools > Use Chef in Treasure Data > Apache like syntax
Treasure Agent (td-agent) > Treasure Data distribution of Fluentd > include ruby, popular plugins and etc > Treasure Agent 2 is current stable > Recommend to use v2, not v1 > rpm, deb and dmg > Latest version is 2.2.0 with fluentd v0.12
# logs from a file # store logs to MongoDB <source> <match backend.*> type tail type mongo path /var/log/httpd.log database fluent pos_file /tmp/pos_file collection test format apache2 </match> tag backend.apache </source> ! # logs from client libraries <source> type forward port 24224 </source> !
Less Simple Forwarding - At-most-once / At-least-once - HA (failover) - Load-balancing
Near realtime and batch combo! Hot data All data
# logs from a file # store logs to ES and HDFS <source> <match web.*> type tail type copy path /var/log/httpd.log <store> pos_file /tmp/pos_file type elasticsearch format apache2 logstash_format true tag web.access </store> </source> <store> ! type webhdfs # logs from client libraries host namenode <source> port 50070 type forward path /path/on/hdfs/ port 24224 </store> </source> </match> !
Treasure Data Frontend Worker Job Queue Hadoop Presto Applications push metrics to Fluentd (via local Fluentd) Fluentd Fluentd sums up data minutes (partial aggregation) Treasure Data Datadog for historical analysis for realtime monitoring
Cookpad hundreds of app servers Rails app td-agent sends event logs Daily/Hourly Google Batch Spreadsheet Rails app td-agent Treasure Data sends event logs MySQL Rails app td-agent Logs are available sends event logs after several mins. KPI Feedback rankings visualization Unlimited scalability Flexible schema Realtime ✓ Over 100 RoR servers (2012/2/4) Less performance impact
fluent-bit > Made for Embedded Linux > OpenEmbedded & Yocto Project > Intel Edison, RasPi & Beagle Black boards > https://github.com/fluent/fluent-bit > Standalone application or Library mode > Built-in plugins > input: cpu, kmsg, output: fluentd > First release at the end of Mar 2015
fluentd-forwarder > Forwarding agent written in Go > Focusing log forwarding to Fluentd > Work on Windows > Bundle TCP input/output and TD output > No flexible plugin mechanizm > We have a plan to add some input/output > Similar product > fluent-agent-lite, fluent-agent-hydra, ik
The problems at Treasure Data > Treasure Data Service on the Cloud > Customers want to try Treasure Data, but > SEs write scripts to bulk load their data. Hard work :( > Customers want to migrate their big data, but > Hard work :( > Fluentd solved streaming data collection, but > bulk data loading is another problem.
Embulk > Bulk Loader version of Fluentd > Pluggable architecture > JRuby, JVM languages > High performance parallel processing > Share your script as a plugin > https://github.com/embulk
The problems of bulk load > Data cleaning (normalization) > How to normalize broken records? > Error handling > How to remove broken records? > Idempotent retrying > How to retry without duplicated loading? > Performance optimization