What’s Embulk? > An open-source parallel bulk data loader > loads records from “A” to “B” Storage, RDBMS, > using plugins NoSQL, Cloud Service, > for various kinds of “A” and “B” etc. > to make data integration easy. > which was very painful… broken records, transactions (idempotency), performance, …
The pains of bulk data loading Example: load a 10GB CSV file to PostgreSQL > 1. First attempt → fails > 2. Write a script to make the records cleaned • Convert ”2015-01-27T19:05:00Z” → “2015-01-27 19:05:00 UTC” • Convert “\N" → “” • many cleanings… > 3. Second attempt → another error • Convert “Inf” → “Infinity” > 4. Fix the script, retry, retry, retry… > 5. Oh, some data got loaded twice!?
The pains of bulk data loading Example: load a 10GB CSV file to PostgreSQL > 6. Ok, the script worked. > 7. Register it to cron to sync data every day. > 8. One day… it fails with another error • Convert invalid UTF-8 byte sequence to U+FFFD
The pains of bulk data loading Example: load 10GB CSV × 720 files > Most of scripts are slow. • People have little time to optimize bulk load scripts > One file takes 1 hour → 720 files takes 1 month (!?) A lot of integration efforts for each storages: > XML, JSON, Apache log format (+some custom), … > SAM, BED, BAI2, HDF5, TDE, SequenceFile, RCFile… > MongoDB, Elasticsearch, Redshift, Salesforce, …
The problems: > Data cleaning (normalization) > How to normalize broken records? > Error handling > How to remove broken records? > Idempotent retrying > How to retry without duplicated loading? > Performance optimization > How to optimize the code or parallelize?
Type conversion Input type system Embulk type system Output type system boolean boolean boolean integer integer long bigint long double precision double ﬂoat text string double varchar string date timestamp array timestamp geo point timestamp with zone geo shape … Input plugin … (e.g. PostgreSQL) (e.g. Elasticsearch) (parser plugin if input is ﬁle-based) Output plugin (formatter plugin if output is ﬁle-based)
What’s added since the ﬁrst release? • v0.3 • Resuming • Filter plugin type • v0.4 • Plugin template generator • Incremental execution (ConﬁgDiﬀ) • Isolated ClassLoaders for Java plugins • Polyglot command launcher
Resuming • Retries a failed transaction without retrying everything. • Skips successful tasks by using information stored in a ﬁle by the previous transaction. • embulk run conﬁg.yml -r resume-state.yml
Filter plugin type • Filtering rows out, ﬁltering columns out, or enrich the data. 18 plugins released.
Plugin template generator • Generates template of a plugin. • Generated code is already ready to compile. > You modify & compile it to do your work. • embulk new <category> <new>
Incremental execution • Store last ﬁle name or row in a ﬁle, and next execution starts from there. • Usecase: sync new ﬁles on S3 to Elasticsearch every day. • embulk run conﬁg.yml -o next-conﬁg.yml
Isolated ClassLoaders for Java plugins • Embulk can load multiple versions of java plugins.
Plugin Version Conﬂicts Java Runtime embulk-input-s3.jar aws-sdk.jar v1.10 Version conﬂicts! embulk-output-redshift.jar aws-sdk.jar v1.9 Embulk Core
Multiple Classloaders in JVM Java Runtime embulk-input-s3.jar Class Loader 1 aws-sdk.jar v1.10 Isolated environments embulk-output-redshift.jar aws-sdk.jar v1.9 Class Loader 2 Embulk Core
Polyglot launcher script • embulk .jar is a jar ﬁle. • embulk.jar is a shell script. • embulk.jar is a bat script. • It sets JVM options to improve performance. • ./embulk run abc
Executor plugin type • embulk-executor-mapreduce executes tasks on distributed environment.
Liquid template engine • A conﬁg ﬁle can include variables.
EmbulkEmbed & Embulk::Runner • Embed embulk in an application.
Plugin bundle • Uses ﬁxed version of plugins. • embulk mkbundle my-project • embulk run -b my-project conﬁg.yml
Gradle v2.6 • Continous compiling. • “embulk migrate .” upgrades gradle versio of your plugin project. • ./gradlew -t build
Future plan • v0.8 • JSON type (issue #306) • Error plugin type (#27, #124) • More (or less) concurrency for output (#231) • v0.9 • More Guess (#242, #235) • Multiple jobs using a single conﬁg ﬁle (#167)