このページは http://www.slideshare.net/MrChrisJohnson/algorithmic-music-recommendations-at-spotify の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

3年弱前 (2014/01/13)にアップロードinテクノロジー

In this presentation I introduce various Machine Learning methods that we utilize for music recom...

In this presentation I introduce various Machine Learning methods that we utilize for music recommendations and discovery at Spotify. Specifically, I focus on Implicit Matrix Factorization for Collaborative Filtering, how to implement a small scale version using python, numpy, and scipy, as well as how to scale up to 20 Million users and 24 Million songs using Hadoop and Spark.

- Algorithmic Music

Discovery at Spotify

Chris Johnson

@MrChrisJohnson

January 13, 2014

Monday, January 13, 14 - Who am I??

•Chris Johnson

– Machine Learning guy from NYC

– Focused on music recommendations

– Formerly a graduate student at UT Austin - 3

What is Spotify?

• On demand music streaming service

• “iTunes in the cloud” - Section name

4 - 5

Data at Spotify....

• 20 Million songs

• 24 Million active users

• 6 Million paying users

• 8 Million daily active users

• 1 TB of compressed data generated from users per day

• 700 node Hadoop Cluster

• 1 Million years worth of music streamed

• 1 Billion user generated playlists - 6

Challenge: 20 Million songs... how do we

recommend music to users? - 7

Recommendation Features

• Discover (personalized recommendations)

• Radio

• Related Artists

• Now Playing - 8

How can we find good

recommendations?

• Manual Curation

• Manually Tag Attributes

• Audio Content,

Metadata, Text Analysis

• Collaborative Filtering - Collaborative Filtering - “The Netflix Prize” 9
- Collaborative Filtering

10

Hey,

I like tracks P, Q, R, S!

Well,

I like tracks Q, R, S, T!

Then you should check out

track P!

Nice! Btw try track T!

Image via Erik Bernhardsson - Section name

11 - Difference between movie and music recs 12

• Scale of catalog

60,000 movies

20,000,000 songs - Difference between movie and music recs 13

• Repeated consumption - Difference between movie and music recs 14

• Music is more niche - “The Netflix Problem” Vs “The Spotify Problem 15

•Netflix: Users explicitly “rate” movies

•Spotify: Feedback is implicit through streaming behavior - Section name

16 - Explicit Matrix Factorization

17

•Users explicitly rate a subset of the movie catalog

•Goal: predict how users will rate new movies

Movies

Users

Chris

Inception - Explicit Matrix Factorization

18

•Approximate ratings matrix by the product of low-

dimensional user and movie matrices

•Minimize RMSE (root mean squared error)

? 3 5 ?

1 ? ? 1

2 ? 3 2

X

Y

? ? ? 5

Inception

Chris

5 2 ? 4

• = user rating for movie

• = bias for user

• = user latent factor vector

• = bias for item

• = item latent factor vector

• = regularization parameter - Implicit Matrix Factorization

19

•Replace Stream counts with binary labels

– 1 = streamed, 0 = never streamed

•Minimize weighted RMSE (root mean squared error) using a

function of stream counts as weights

1 0 0 0 1 0 0 1

0 0 1 0 0 1 0 0

1 0 1 0 0 0 1 1

X

Y

0 1 0 0 0 1 0 0

0 0 1 0 0 1 0 0

1 0 0 0 1 0 0 1

• = 1 if user streamed track else 0

• = bias for user

•

• = bias for item

• = user latent factor vector

• = regularization parameter

• =i tem latent factor vector - Alternating Least Squares

20

• Initialize user and item vectors to random noise

• Fix item vectors and solve for optimal user vectors

– Take the derivative of loss function with respect to user’s vector, set

equal to 0, and solve

– Results in a system of linear equations with closed form solution!

• Fix user vectors and solve for optimal item vectors

• Repeat until convergence

code: https://github.com/MrChrisJohnson/implicitMF - Alternating Least Squares

21

• Note that:

• Then, we can pre-compute once per iteration

– and only contain non-zero elements for tracks that

the user streamed

– Using sparse matrix operations we can then compute each user’s

vector efficiently in time where is the number of

tracks the user streamed

code: https://github.com/MrChrisJohnson/implicitMF - 23

How do we use the learned vectors?

•User-Item score is the dot product

•Item-Item similarity is the cosine similarity

•Both operations have trivial complexity based on the number of

latent factors - 24

Latent Factor Vectors in 2 dimensions - Section name

25 - Scaling up Implicit Matrix Factorization 26

with Hadoop - Hadoop at Spotify 2009

27 - Hadoop at Spotify 2014

28

700 Nodes in our London data center - Implicit Matrix Factorization with Hadoop 29

Map step

Reduce step

item vectors

item vectors

item vectors

item%L=0

item%L=1

i % L = L-1

user vectors

u % K = 0

u % K = 0

u % K = 0

...

u % K = 0

u % K = 0

i % L = 0

i % L = 1

i % L = L-1

user vectors

u % K = 1

u % K = 1

...

...

u % K = 1

u % K = 1

i % L = 0

i % L = 1

...

...

...

...

u % K = K-1

u % K = K-1

user vectors

...

...

i % L = 0

i % L = L-1

u % K = K-1

u % K = K-1

all log entries

u % K = 1

i % L = 1

Figure via Erik Bernhardsson - Implicit Matrix Factorization with Hadoop 30

One map task

Distributed

cache:

All user vectors

where u % K = x

Distributed

cache:

Mapper

Emit contributions

Reducer

New vector!

All item vectors

where i % L = y

Map input:

tuples (u, i, count)

where

u % K = x

and

i % L = y

Figure via Erik Bernhardsson - Implicit Matrix Factorization with Spark

31

Spark

Vs

Hadoop

http://www.slideshare.net/Hadoop_Summit/spark-and-shark - Section name

32 - Ensemble of Latent Factor Models

34

Figure via Erik Bernhardsson - AB-Testing Recommendations

35 - Open Problems

36

•How to go from predictive model to related artists? (learning

to rank?)

•How do you learn from user feedback?

•How do you deal with observation bias in the user feedback?

(active learning?)

•How to factor in temporal information?

•How much value in content based recommendations?

•How to best evaluate model performance?

•How to best train an ensemble? - Section name

37

Thank You! - Section name

38 - Section name

39 - Section name

40 - Section name

41 - Section name

42