Embulk An open-source plugin-based parallel bulk data loader that makes painful data integration work relaxed. Sharing our knowledge on RubyGems to manage arbitrary files. Sadayuki Furuhashi Founder & Software Architect Treasure Data, inc.
A little about me... > Sadayuki Furuhashi > github/twitter: @frsyuki > Treasure Data, Inc. > Founder & Software Architect > Open-source hacker > MessagePack - Efficient object serializer > Fluentd - An unified data collection tool > Prestogres - PostgreSQL protocol gateway for Presto > Embulk - A plugin-based parallel bulk data loader > ServerEngine - A Ruby framework to build multiprocess servers > LS4 - A distributed object storage with cross-region replication > kumofs - A distributed strong-consistent key-value data store
Today’s talk > What’s Embulk? > How Embulk works? > The architecture > Writing Embulk plugins > Roadmap & Development > Q&A + Discussion
What’s Embulk? > An open-source parallel bulk data loader > using plugins > to make data integration relaxed.
What’s Embulk? > An open-source parallel bulk data loader > loads records from “A” to “B” Storage, RDBMS, > using plugins NoSQL, Cloud Service, > for various kinds of “A” and “B” etc. > to make data integration relaxed. > which was very painful… broken records, transactions (idempotency), performance, …
The pains of bulk data loading Example: load a 10GB CSV file to PostgreSQL > 1. First attempt → fails > 2. Write a script to make the records cleaned • Convert ”20150127T190500Z” → “2015-01-27 19:05:00 UTC” • Convert “\N" → “” • many cleanings… > 3. Second attempt → another error • Convert “Inf” → “Infinity” > 4. Fix the script, retry, retry, retry… > 5. Oh, some data got loaded twice!?
The pains of bulk data loading Example: load a 10GB CSV file to PostgreSQL > 6. Ok, the script worked. > 7. Register it to cron to sync data every day. > 8. One day… it fails with another error • Convert invalid UTF-8 byte sequence to U+FFFD
The pains of bulk data loading Example: load 10GB CSV × 720 files > Most of scripts are slow. • People have little time to optimize bulk load scripts > One file takes 1 hour → 720 files takes 1 month (!?) A lot of integration efforts for each storages: > XML, JSON, Apache log format (+some custom), … > SAM, BED, BAI2, HDF5, TDE, SequenceFile, RCFile… > MongoDB, Elasticsearch, Redshift, Salesforce, …
The problems: > Data cleaning (normalization) > How to normalize broken records? > Error handling > How to remove broken records? > Idempotent retrying > How to retry without duplicated loading? > Performance optimization > How to optimize the code or parallelize?
The problems at Treasure Data Treasure Data Service? > “Fast, powerful SQL access to big data from connected applications and products, with no new infrastructure or special skills required.” > Customers want to try Treasure Data, but > SEs write scripts to bulk load their data. Hard work :( > Customers want to migrate their big data, but > Hard work :( > Fluentd solved streaming data collection, but > bulk data loading is another problem.
A solution: > Package the efforts as a plugin. > data cleaning, error handling, retrying > Share & reuse the plugin. > don’t repeat the pains! > Keep improving the plugin code. > rather than throwing away the efforts every time > using OSS-style pull-reqs & frequent releases.
Embulk is an open-source, plugin-based Embulk parallel bulk data loader that makes data integration works relaxed.
read records write records InputPlugin OutputPlugin Embulk executor plugin
MySQL, Cassandra, HBase, Elasticsearch, Treasure Data, … InputPlugin OutputPlugin record record Embulk executor plugin
read files write files FileInputPlugin FileOutputPlugin decompress compress DecoderPlugin EncoderPlugin InputPlugin OutputPlugin parse files format records into records into files FormatterPlugin ParserPlugin Embulk executor plugin
Roadmap > Add missing JRuby Plugin APIs > ParserPlugin, FormatterPlugin > DecoderPlugin, EncoderPlugin > Add Executor plugin SPI > Add ssh distributed executor > embulk run —command ssh %host embulk run %task > Add MapReduce executor > Add support for nested records (?)
Contributing to the Embulk project > Pull-requests & issues on Github > Posting blogs > “I tried Embulk. Here is how it worked” > “I read Embulk code. Here is how it’s written” > “Embulk is good because…but bad because…” > Talking on Twitter with a word “embulk" > Writing & releasing plugins > Windows support > Integration to other software > ETL tools, Fluentd, Hadoop, Presto, …