2 What is Luigi Luigi is a workflow engine If you run 10,000+ Hadoop jobs every day, you need one If you play around with batch processing just for fun, you want one Doesn’t help you with the code, that’s what Scalding, Pig, or anything else is good at It helps you with the plumbing of connecting lots of tasks into complicated pipelines, especially if those tasks run on Hadoop
3 What do we use it for? Music recommendations A/B testing Top lists Ad targeting Label reporting Dashboards … and a million other things!
4 Currently running 10,000+ Hadoop jobs every day On average a Hadoop job is launched every 10s There’s 2,000+ Luigi tasks in production
5 Some history … let’s go back to 2008!
6 The year was 2008 I was writing my master’s thesis about music recommendations Had to run hundreds of long-running tasks to compute the output
7 Toy example: classify skipped tracks Log d Subsample Train Look at the Log d+1 and extract classifier output features ... Subsampled Classifier Log d+k-1 features $ python subsample_extract_features.py /log/endsongcleaned/2011-01-?? /tmp/subsampled $ python train_model.py /tmp/subsampled model.pickle $ python inspect_model.py model.pickle
8 Reproducibility matters …and automation. ! The previous code is really hard to run again
9 Let’s make into a big workflow $ python run_everything.py
10 Reality: crashes wil happen How do you resume this?
11 Ability to resume matters When you are developing something interactively, you will try and fail a lot Failures will happen, and you want to resume once you fixed it You want the system to figure out exactly what it has to re-run and nothing else Atomic file operations is crucial for the ability to resume
12 So let’s make it possible to resume
13 But still annoying parts Hardcoded junk
14 Generalization matters You should be able to re-run your entire pipeline with a new value for a parameter Command line integration means you can run interactive experiments
15 … now we’re getting something $ python run_everything.py --date- first 2014-01-01 --date-last 2014-01-31 --n-trees 200
16 … but it’s hardly readable BOILERPLATE
17 Boilerplate matters! We keep re-implementing the same functionality Let’s factor it out to a framework
18 A lot of real-world data pipelines are a lot more complex The ideal framework should make it trivial to build up big data pipelines where dependencies are non-trivial (eg depend on date algebra)
19 So I started thinking Wanted to build something like GNU Make
20 What is Make and why is it pretty cool? Build reusable rules # the compiler: gcc for C program, define as g++ for C++ CC = gcc Specify what you want to build and then ! backtrack to find out what you need # compiler flags: # -g adds debugging information to the executable file in order to get there # -Wall turns on most, but not all, compiler warnings CFLAGS = -g -Wall Reproducible runs ! # the build target executable: TARGET = myprog ! all: $(TARGET) ! $(TARGET): $(TARGET).c
$(CC) $(CFLAGS) -o $(TARGET) $(TARGET).c ! clean:
21 We want something that works for a wide range of systems We need to support lots of systems “80% of data science is data munging”
22 Data processing needs to interact with lots of systems Need to support practically any type of task:
Ingest into Cassandra
SCP file somewhere else
23 My first attempt: builder Use XML config to build up the dependency graph!
24 Don’t use XML … seriously, don’t use it
25 Dependencies need code Pipelines deployed in production often have nontrivial ways they define dependencies between tasks ! ! Date algebra Recursion (and date algebra) Log(date=2014-01-01) ! ... Enum types Log(date=2014-01-02) ! IdMap(type=artist) IdMap(type=track) ... ! Log(date=2014-04-29) BloomFilter(date=2014-04-30) IdToIdMap(from_type=artist, to_type=track) ! Log(date=2014-01-31) Log(date=2014-04-30) BloomFilter(date=2014-05-01) ! Toplist(date_interval=2014-01) !
… and many other cases
26 Don’t ever invent your own DSL “It’s better to write domain specific code in a general purpose language, than writing general purpose code in a domain specific language” – unknown author ! ! Oozie is a good example of how messy it gets
27 2009: builder2 Solved all the things I just mentioned - Dependency graph specified in Python - Support for arbitrary tasks - Error emails - Support for lots of common data plumbing stuff: Hadoop jobs, Postgres, etc - Lots of other things :)
29 More graphs!
30 Even more graphs!
31 What were the good bits? ! Build up dependency graphs and visualize them Non-event to go from development to deployment Built-in HDFS integration but decoupled from the core library ! ! What went wrong? ! Stil too much boiler plate Pretty bad command line integration
33 Introducing Luigi A workflow engine in Python
34 Luigi – History at Spotify Late 2011: Me and Elias Freider build it, release it into the wild at Spotify, people start using it “The Python era” ! Late 2012: Open source it Early 2013: First known company outside of Spotify: Foursquare !
35 Luigi is your friendly plumber Simple dependency definitions Emphasis on Hadoop/HDFS integration Atomic file operations Data flow visualization Command line integration
Luigi Task 36
Luigi Task – breakdown 37 The business logic of the task Where it writes output What other tasks it depends on Parameters for this task
38 Easy command line integration So easy that you want to use Luigi for it $ python my_task.py MyTask --param 43 INFO: Scheduled MyTask(param=43) INFO: Scheduled SomeOtherTask(param=43) INFO: Done scheduling tasks INFO: [pid 20235] Running SomeOtherTask(param=43) INFO: [pid 20235] Done SomeOtherTask(param=43) INFO: [pid 20235] Running MyTask(param=43) INFO: [pid 20235] Done MyTask(param=43) INFO: Done INFO: There are no more tasks to run at this time INFO: Worker was stopped. Shutting down Keep-Alive thread $ cat /tmp/foo/bar-43.txt hello, world $
39 Let’s go back to the example Log d Subsample Train Look at the Log d+1 and extract classifier output features ... Subsampled Classifier Log d+k-1 features
40 Code in Luigi
41 Extract the features
42 Run on the command line $ python demo.py SubsampleFeatures --date-interval 2013-11-01 DEBUG: Checking if SubsampleFeatures(test=False, date_interval=2013-11-01) is complete INFO: Scheduled SubsampleFeatures(test=False, date_interval=2013-11-01) DEBUG: Checking if EndSongCleaned(date_interval=2013-11-01) is complete INFO: Done scheduling tasks DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 1 INFO: [pid 24345] Running SubsampleFeatures(test=False, date_interval=2013-11-01) ... INFO: 13/11/08 02:15:11 INFO streaming.StreamJob: Tracking URL: http://lon2- hadoopmaster-a1.c.lon.spotify.net:50030/jobdetails.jsp?jobid=job_201310180017_157113 INFO: 13/11/08 02:15:12 INFO streaming.StreamJob: map 0% reduce 0% INFO: 13/11/08 02:15:27 INFO streaming.StreamJob: map 2% reduce 0% INFO: 13/11/08 02:15:30 INFO streaming.StreamJob: map 7% reduce 0% ... INFO: 13/11/08 02:16:10 INFO streaming.StreamJob: map 100% reduce 87% INFO: 13/11/08 02:16:13 INFO streaming.StreamJob: map 100% reduce 100% INFO: [pid 24345] Done SubsampleFeatures(test=False, date_interval=2013-11-01) DEBUG: Asking scheduler for work... INFO: Done INFO: There are no more tasks to run at this time INFO: Worker was stopped. Shutting down Keep-Alive thread $
45 Let’s make it more complicated – cross validation Log d Subsample Train Log d+1 and extract classifier features ... Subsampled Classifier Log d+k-1 features Log e Subsample Log e+1 and extract features ... Subsampled Cross validation Log e+k-1 features
Section name 51 Minimal boiler plate Overhead for a task is about 5 lines (class def + requires + output + run) Easy command line integration
52 Everything is a directed acyclic graph Makefile style Tasks specify what they are dependent on not what other things depend on them
53 Luigi’s visualizer
54 Dive into any task
55 Run with multiple workers $ python dataflow.py --workers 3 AggregateArtists --date-interval 2013-W08
56 Error notifications
57 Process synchronization Prevents the same task from being run simultaneously, but all execution is being done by the workers. Luigi central planner Luigi worker 1 Luigi worker 2 A A C B C F
58 Luigi is a way of coordinating lots of different tasks … but you still have to figure out how to implement and scale them!
59 Do general-purpose stuff Don’t focus on a specific platform ! … but comes “batteries included”
60 Built-in support for HDFS & Hadoop At Spotify we’re abandoning Python for batch processing tasks, replacing it with Crunch and Scalding. Luigi is a great glue! ! Our team, the Lambda team: 15 engs, running 1,000+ Hadoop jobs daily, having 400+ Luigi Tasks in production. ! Our recommendation pipeline is a good example: Python M/R jobs, ML algos in C++, Java M/R jobs, Scalding, ML stuff in Python using scikit-learn, import stuff into Cassandra, import stuff into Postgres, send email reports, etc.
61 The one time we accidentally deleted 50TB of data We didn’t have to write a single line of code to fix it – Luigi rescheduled 1000s of task and ran it for 3 days
62 Some things are still not perfect
63 The missing parts Execution is tied to scheduling – you can’t schedule something to run “in the cloud” Visualization could be a lot more useful There’s no built scheduling – have to rely on crontab These are all things we have in the backlog
64 What are some ideas for the future? Source:
65 Separate scheduling and execution Slave Slave Luigi central scheduler Slave Slave ...
66 Luigi in Scala?
67 Luigi implements some core beliefs The #1 focus is on removing all boiler plate The #2 focus is to be as general as possible The #3 focus is to make it easy to go from test to production ! !