During one of our epic parties, Martin Lorentzon (chairman of Spotify) agreed to help me to arran...
During one of our epic parties, Martin Lorentzon (chairman of Spotify) agreed to help me to arrange a dinner for me and Timbuktu (my favourite Swedish rap and reggae artist), if I prove somehow that I am the biggest fan of Timbuktu in my home country. Because at Spotify we attack all problems using data-driven approaches, I decided to implement a Hive query that processes real datasets to figure out who streams Timbuktu the most frequently in my country. Although this problem seems to be well-defined, one can find many challenges in implementing this query efficiently and they relate to sampling, testing, debugging, troubleshooting, optimizing and executing it over terabytes of data on the Hadoop-YARN cluster that contains hundreds of nodes. During my talk, I will describe all of them, and share how to increase your (and the cluster’s) productivity by following tips and best practices for analyzing large datasets with Hive on YARN. I will also explain how the newly-added features to Hive (e.g. join optimizations, OCR File Format and Tez integration that is coming soon) can be used to make your query extremely fast.