At Spotify we collect huge volumes of data for many purposes. Reporting to labels, powering our p...
At Spotify we collect huge volumes of data for many purposes. Reporting to labels, powering our product features, and analyzing user growth are some of our most common ones. Additionally, we collect many operational metrics related to the responsiveness, utilization and capacity of our servers. To store and process this data, we use scalable and fault-tolerant multi-system infrastructure, and Apache Hadoop is a key part of it. Surprisingly or not, Apache Hadoop generates large amounts of data in the form of logs and metrics that describe its behaviour and performance. To process this data in a scalable and performant manner we use … also Hadoop! During this presentation, I will talk about how we analyze various logs generated by Apache Hadoop using custom scripts (written in Pig or Java/Python MapReduce) and available open-source tools to get data-driven answers to many questions related to the behaviour of our 690-node Hadoop cluster. At Spotify we frequently leverage these tools to learn how fast we are growing, when to buy new nodes, how to calculate the empirical retention policy for each dataset, optimize the scheduler, benchmark the cluster, find its biggest offenders (both people and datasets) and more.