(Big) Data At Spotify Adam Kawa Data Engineer @ Spotify
At Spotify, important questions are being asked all the time!
Some of these questions are ”relatively easy” to answer…
Labels, Licensor, Partners, Advertisers 1. How many times has Coldplay been streamed this month? 2. Who was the most popular artist in NYC last week? 3. How many times was “Get Lucky” streamed during first 24h?
Reporting ■ Very granular reports are required - Divided by gender, age, location and more ■ We have been delivering various reports from day 1 - Too much data for traditional solutions
Popular Artists Question Who was the most frequently streamed female artist in 2013? Answer? A) Katy Perry B) Lady Gaga C) Madonna D) Rihanna
Popular Artists Question Who was the most frequently streamed female artist in 2013?
Popular Artists In 2013 ■ The Most Popular Male Artist - Macklemore ■ The Most Popular Band - Imagine Dragons ■ The Most Popular Track - “Can't Hold Us”
Popular Artists In 2013 ■ Users love local artists! - Berlin - Sido - London - Coldplay - Singapore – Vanessa-Mae - Stockholm - Avicii
Popular Artists In 2013 ■ Users love local artists! - NYC listens to Jay-Z 88% more than rest of the world - Stockholm listens to ABBA 110% more than the rest of the world
Popular Tracks Question What was the most “viral” track in 2013?
Popular Tracks Question What was the most “viral” track in 2013? Answer “Get Lucky” by Daft Punk feat. Pharrell Williams
Artist Analytics – Daft Punk “Get Lucky” was released on April, 19th 2013.
Artist Analytics – Daft Punk Around 5x more streams comparing a day “before” and “after” “Get Lucky”
Artist Analytics – Daft Punk What happened that day?
Artist Analytics – Daft Punk “Random Access Memories” was released on May, 17th 2013.
Artist Analytics – Whitney Houston ■ 09.08.63 – 11.02.2012
Artist Analytics – Budka Suflera ■ One of the most popular Polish rock bands ever What happened?
Artist Analytics – Budka Suflera ■ One of the most popular Polish rock bands ever Information about the retirement was announced...
Management And Investors 1. What was the number of daily active users (DAU) yesterday? 2. How many users have signed up this week? 3. Which country to launch Spotify next?
Business Analytics ■ Analyzing growth - Number of active users, streamed songs, sign-ups and more - Where to launch Spotify next ■ Company KPIs
However, some of the questions are really tricky to answer!
Data Scientists, Researchers 1. What song to stream to Jay-Z when he wakes up? 2. Is Adam Kawa bored with Timbuktu today? 3. How to encourage Jeff to go for the Premium Account?
Product Features ■ Recommendations - Powering features like Discover, Radio - “Perfect music for every moment ♪♫ ♪♫ ♬ ♬ ♯” ■ Classification of songs and playlists by genre or mood ■ Top lists per country
Perfect Music For Every Moment ■ Overall, in 2013 - Best Hangover Cure - “The Lazy Song” - Best Song To Get Over An Ex - “Someone like you” - Best Party Starter - “Levels” - Best Driving Song – “Bohemian Rhapsody” - Best Work Out Song - “Eye of the Tiger”
Designers, Feature's Owners 1. Is this button nicer that the previous one? 2. How to personalize the messages displayed to users? 3. How should the results of search be displayed?
Designers, Feature's Owners ■ A/B Test - Come with promising “look-and-feels” and do A/B tests ■ Explicit feedback from users - But users usual y do not like to rate things - But users usual y do not like to customize things
A/B Test Use Case Sign-up button on the ■ Sign-up Button On Facebook landing page
Sign-up Button On Facebook Layouts of sign-up button A – Control Group (50%) B – Test Group (50%)
Sign-up Button On Facebook Layouts of sign-up button Which one A – Control Group (50%) performed better? B – Test Group (50%)
Sign-up Button On Facebook Layouts of sign-up button A – Control Group (50%) Much more sign-ups! B – Test Group (50%)
A/B Tests ■ “Only 10% are likely to cause a true uplif” - Google after 12K tests - Be able to iterate fast! ■ “80% of the times, we are wrong about what consumers want” - The truth is in data!
In the past, we guesstimated a bit (common sense, intuition, gut feeling, observations, inspirations)
Isn't it Isn't it insp insp ired ired “KöP!” öP!” mea mea ns “BU “BU Y!” Y!” by th y th e Window's w's Menu Start Menu Start butto utto n? ; n? ) ;)
Today, we make data-driven decisions
To make data-driven decision data and data-infrastructure are required (among the others)
Users At Spotify ■ Over 6 million of paying subscribers ■ Over 24 million of MAU (monthly active users) ■ 1.5 bil ion playlists created so far ■ Available in 55 countries ■ Over 20 mil ion of songs ■ 4,5 bil ion hours streamed in 2013
(Big) Data At Spotify ■ Data generated by users and for users! - 1.5 TB of compressed data from users per day - 64 TB of data generated in Hadoop each day (triplicated)
Data Infrastructure At Spotify ■ Apache Hadoop YARN ■ Many other systems including: - Kafka, Luigi, Cassandra, PostgreSQL in production - Giraph, Tez, Spark in the evaluation mode
Apache Hadoop ■ Probably the largest commercial Hadoop cluster in Europe! - 694 heterogeneous nodes - 12.63 PB of data used - ~7.000 job each day
Apache Hadoop ■ Used for “off-line” processing - When Hadoop is down, Spotify still plays music! - When Hadoop is down, Data Analysts play FIFA, table tennis or … run queries locally ■ We mostly analyze logs from users' activity
What Does Hadoop Allow Us To Do? ■ Get insights to offer a better product - “More data usual y beats better algorithms” ■ Get insights to make better decisions - Avoid “guesstimates” ■ Take a competitive advantage - More companies have started offering music streaming
How Do We Use Hadoop? ■ We use multiple tools and languages - Hive is very popular among our data analysts - Crunch for core pipeline jobs - Some legacy code in Hadoop Streaming with Python - A number of Pig, Java MapReduce jobs - Avro as storage format (but we start considering columnar formats)
Apache Kafka ■ Primarily used to transport logs - from multiple servers - to a central location for storage and analysis ■ A better fit for us than Flume - We got higher throughput with Kafka ■ We added more features to Kafka - End-to-end delivery - Encryption
Apache Cassandra ■ A scalable and distributed key-value store ■ Provides fast read-write access for many small pieces of data - We use it for playlists, user profiles, popularity count ■ Was a better fit for us than HBase - The NN was the SPOF at that time
Luigi ■ Al ows us to build complex pipelines of batch jobs ■ Handles dependency resolution, workflow management, visualization and more ■ Our alternative to Oozie and Azkaban - Spotify, Foursquare, Bitly and more contribute
RDBMS We still use them! ■ Powering features that require transactions support, integrity constraints - e.g. ordering Spotify gift-cards ■ Semi-aggregated data for dashboards ■ Semi-aggregated data for quick analysis
March 2013 Tricky questions were asked!
Finance Department 1. How many servers do you need to buy to survive one year? 2. If we agree, what will you do to use them efficiently? 3. If we agree, do not come back to us this year, OK?
Adam Kawa ■ Partial y responsible for answering these questions! ■ One of Data Engineers who - takes care of 694-node Hadoop-YARN cluster - implements and troubleshoots users' jobs - works in a team with Josh, Marcin, Rafal, Fabian and Wouter ■ Hadoop instructor for almost 2 years ■ Co-organizer of Warsaw and Stockholm HUGs ■ Blogger at HakunaMapData.com
Operational Metrics ■ Latency analysis - msec to wait for music after pressing the “Play” button ■ Capacity planning - servers, bandwidth, data-center space and more
Operational Metrics For Hadoop ■ Hadoop provides tons of metrics, logs and files ■ They can be analyzed by … Hadoop
What Hadoop Can Tell About Itself ■ This knowledge can be useful to learn how to - measure how fast our HDFS is growing - calculate the empirical retention policy for datasets - optimize the scheduler - benchmark the cluster - and more
Let's see a couple of examples
5.000 TB of data created before October 1, 2013
Could we Archive data accessed before this day?
Advanced HDFS Capacity Planning ■ You can analyze FsImage file to learn how fast you grow ■ You can even correlate this data with - number of DAU - total size of logs generated by users - activity of users e.g. hours streamed - number of queries / day run by analysts
Simplified HDFS Capacity Planning ■ You can also use ''trend feature'' in Ganglia If we do NOTHING, we will fill the cluster in September...
What will we do to surviver longer than September?
■ We introduced an automatic retention policy - An owner of the dataset specifies a retention period - If needed, a retention period can be calculated empirical y
We continuously improve our MapReduce jobs
Recurring MapReduce Jobs ■ We schedule some jobs each hour, day or week e.g.: - Top lists for each country - Reports for the labels, partners, advertisers Idea ■ Use job statistics from the previous executions of a job - to optimize the current execution of this job - to learn about the history of performance of a given job Even perfect manual setting may become eventually outdated when an input dataset grows!
MapReduce Jobs Autotuning ■ A tiny PoC ;) ■ The average task time set to 10 minutes (inspired by LinkedIn) type # map # reduce avg map time avg reduce time job execution time old_1 4826 25 46sec 1hrs, 52mins, 14sec 2hrs, 52mins, 16sec new_1 391 294 4mins, 46sec 8mins, 24sec 23mins, 12sec type # map # reduce avg map time avg reduce time job execution time old_2 4936 800 7mins, 30sec 22mins, 18sec 5hrs, 20mins, 1sec new_2 4936 1893 8mins, 52sec 7mins, 35sec 1hrs, 18mins, 29sec ■ It should help in extreme cases: very short and long living tasks
Summary ■ We make data-driven decisions to improve our product ■ Scalable and open-source projects al ows us to do that ■ Hadoop, Cassandra, Kafka need love and care - And passionate people who give it to them ■ Hadoop is like a salutary virus - It quickly spreads across people and projects
One Question: What could happen after some time of simultaneous development of MapReduce jobs, maintenance of a large cluster, and listening to perfect music for every moment?
A Possible Answer: You may discover Hadoop in the lyrics of many popular songs!