Btw I’m Erik Bernhardsson I’m at Spotify in NYC Focusing mostly on music discovery and large scale machine learning Previously managed the “Analytics team” in Stockholm
Background Why did we build Luigi? We crunch a lot of data Billions of log messages (several TBs) every day Usage and backend stats, debug information What we want to do AB-testing Music recommendations Monthly/daily/hourly reporting Business metric dashboards We experiment a lot – need quick development cycles
We like Hadoop Our second cluster (in 2009):
Our fifth cluster Long story short :)
Running one job is easy But what about running 1000s of job every day? Lots of long-running processes with dependencies Need monitoring Handle failures Go from experimentation to production easily
But also non-Hadoop stuff Most things are Python Map/Reduce jobs Also Pig, Hive SCP files from one host to another Train a machine learning model Put data in Cassandra
How not to do workflows In the pre-Luigi world
Example: Artist Toplist “Streams” is a list of (username, track, artist ,timestamp) tuples Artist Streams Aggregation Top 10 Database
Don’t do this at home Pre-Luigi example of artist toplists
OK, so chain the tasks
Cron nicer, yay!
Errors will occur That’s OK, but don’t leave broken data somewhere (btw, Luigi gives you atomic file operations locally and in HDFS)
Don’t run things twice The second step fails, you fix it, then you want to resume
Parametrize tasks To use data flows as command line tools
Put tasks in loops You want to run the dataflow for a set of similar inputs
Plumbing sucks... Graph algorithms rock!
Who’s the world’s second most famous plumber? Hint: he wears green
Introducing Luigi A Python framework for data flow definition and execution
Luigi is “kind of like Makefile” in Python On steroids and PCP ... with a toolbox of mainly Hadoop related stuff Main features Simple dependency definitions Emphasis on Hadoop/HDFS integration Atomic file operations Data flow visualization Command line integration
Luigi - Aggregate Artists
Luigi - Aggregate Artists Run on the command line: $ python dataflow.py AggregateArtists DEBUG: Checking if AggregateArtists() is complete INFO: Scheduled AggregateArtists() DEBUG: Checking if Streams() is complete INFO: Done scheduling tasks DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 1 INFO: [pid 74375] Running AggregateArtists() INFO: [pid 74375] Done AggregateArtists() DEBUG: Asking scheduler for work... INFO: Done INFO: There are no more tasks to run at this time
Completing the top list Top 10 artists - Wrapped arbitrary Python code
Database support Basic functionality for exporting to Postgres. Cassandra support is in the works
Running it all... DEBUG: Checking if ArtistToplistToDatabase() is complete INFO: Scheduled ArtistToplistToDatabase() DEBUG: Checking if Top10Artists() is complete INFO: Scheduled Top10Artists() DEBUG: Checking if AggregateArtists() is complete INFO: Scheduled AggregateArtists() DEBUG: Checking if Streams() is complete INFO: Done scheduling tasks DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 3 INFO: [pid 74811] Running AggregateArtists() INFO: [pid 74811] Done AggregateArtists() DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 2 INFO: [pid 74811] Running Top10Artists() INFO: [pid 74811] Done Top10Artists() DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 1 INFO: [pid 74811] Running ArtistToplistToDatabase() INFO: Done writing, importing at 2013-03-13 15:41:09.407138 INFO: [pid 74811] Done ArtistToplistToDatabase() DEBUG: Asking scheduler for work... INFO: Done INFO: There are no more tasks to run at this time
The results Imagine how cool this would be with real data...
Task Parameters Class variables with some magic Tasks have implicit __init__ Generates command line interface with typing and documentation $ python dataflow.py AggregateArtists --date 2013-03-05
Task Parameters Combined usage example
Task templates and targets ... how to run anything, really Luigi comes with a toolbox of abstract Tasks for... Running Hadoop MapReduce utilizing Hadoop Streaming or custom jar-files Running Hive and (soon) Pig queries Inserting data sets into Postgres Writing new ones are as easy as defining an interface and implementing run()
Hadoop MapReduce Built-in Hadoop Streaming Python framework Features Tiny interface – just implement mapper and reducer Fetches error logs from Hadoop cluster and displays them to the user Class instance variables can be referenced in MapReduce code, which makes it easy to supply extra data in dictionaries etc. for map side joins Easy to send along Python modules that might not be installed on the cluster Support for counters, secondary sort, combiners, distributed cache, etc. Runs on CPython so you can use your favorite libs (numpy, pandas etc.)
Luigi does not have 999 features Instead, focus on ridiculously little boiler plate code General so you can build whatever on top of it As well as rapid experimentation cycle Once things work, trivial to put in production
What we use Luigi for Hadoop Streaming Java Hadoop MapReduce Hive Pig Train machine learning models Import/export data to/from Postgres Insert data into Cassandra scp/rsync/ftp data files and reports Dump and load databases Others using it with Scala MapReduce and MRJob as well
Be one of the cool kids!
Luigi is open source https://github.com/spotify/luigi Originated at Spotify Mainly built by me and Elias Freider Based on many years of experience with data processing Open source since September 2012