このページは http://www.slideshare.net/MrChrisJohnson/collaborative-filtering-with-spark の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

2年以上前 (2014/05/09)にアップロードinテクノロジー

Spotify uses a range of Machine Learning models to power its music recommendation features includ...

Spotify uses a range of Machine Learning models to power its music recommendation features including the Discover page and Radio. Due to the iterative nature of training these models they suffer from IO overhead of Hadoop and are a natural fit to the Spark programming paradigm. In this talk I will present both the right way as well as the wrong way to implement collaborative filtering models with Spark. Additionally, I will deep dive into how Matrix Factorization is implemented in the MLlib library.

- Who am I??

•Chris Johnson

– Machine Learning guy from NYC

– Focused on music recommendations

– Formerly a graduate student at UT Austin - 3

What is MLlib?

Algorithms:

• classification: logistic regression, linear support vector machine

(SVM), naive bayes

• regression: generalized linear regression

• clustering: k-means

• decomposition: singular value decomposition (SVD), principle

component analysis (PCA

• collaborative filtering: alternating least squares (ALS)

http://spark.apache.org/docs/0.9.0/mllib-guide.html - 4

What is MLlib?

Algorithms:

• classification: logistic regression, linear support vector machine

(SVM), naive bayes

• regression: generalized linear regression

• clustering: k-means

• decomposition: singular value decomposition (SVD), principle

component analysis (PCA

• collaborative filtering: alternating least squares (ALS)

http://spark.apache.org/docs/0.9.0/mllib-guide.html - Collaborative Filtering - “The Netflix Prize” 5
- Collaborative Filtering

6

Hey,

I like tracks P, Q, R, S!

Well,

I like tracks Q, R, S, T!

Then you should check out

track P!

Nice! Btw try track T!

Image via Erik Bernhardsson - 7

Collaborative Filtering at Spotify

• Discover (personalized recommendations)

• Radio

• Related Artists

• Now Playing - Section name

8 - Explicit Matrix Factorization

9

•Users explicitly rate a subset of the movie catalog

•Goal: predict how users will rate new movies

Movies

Users

Chris

Inception - Explicit Matrix Factorization

10

•Approximate ratings matrix by the product of low-

dimensional user and movie matrices

•Minimize RMSE (root mean squared error)

? 3 5 ?

1 ? ? 1

2 ? 3 2

X

Y

? ? ? 5

Inception

Chris

5 2 ? 4

• = user rating for movie

• = bias for user

• = user latent factor vector

• = bias for item

• = item latent factor vector

• = regularization parameter - Implicit Matrix Factorization

11

•Replace Stream counts with binary labels

– 1 = streamed, 0 = never streamed

•Minimize weighted RMSE (root mean squared error) using a

function of stream counts as weights

1 0 0 0 1 0 0 1

0 0 1 0 0 1 0 0

1 0 1 0 0 0 1 1

X

Y

0 1 0 0 0 1 0 0

0 0 1 0 0 1 0 0

1 0 0 0 1 0 0 1

• = 1 if user streamed track else 0

• = bias for user

•

• = bias for item

• = user latent factor vector

• = regularization parameter

• =i tem latent factor vector - Alternating Least Squares

12

• Initialize user and item vectors to random noise

• Fix item vectors and solve for optimal user vectors

– Take the derivative of loss function with respect to user’s vector, set

equal to 0, and solve

– Results in a system of linear equations with closed form solution!

• Fix user vectors and solve for optimal item vectors

• Repeat until convergence

code: https://github.com/MrChrisJohnson/implicitMF - Alternating Least Squares

13

• Note that:

• Then, we can pre-compute once per iteration

– and only contain non-zero elements for tracks that

the user streamed

– Using sparse matrix operations we can then compute each user’s

vector efficiently in time where is the number of

tracks the user streamed

code: https://github.com/MrChrisJohnson/implicitMF - Section name

15 - Scaling up Implicit Matrix Factorization 16

with Hadoop - Hadoop at Spotify 2009

17 - Hadoop at Spotify 2014

18

700 Nodes in our London data center - Implicit Matrix Factorization with Hadoop 19

Map step

Reduce step

item vectors

item vectors

item vectors

item%L=0

item%L=1

i % L = L-1

user vectors

u % K = 0

u % K = 0

u % K = 0

...

u % K = 0

u % K = 0

i % L = 0

i % L = 1

i % L = L-1

user vectors

u % K = 1

u % K = 1

...

...

u % K = 1

u % K = 1

i % L = 0

i % L = 1

...

...

...

...

u % K = K-1

u % K = K-1

user vectors

...

...

i % L = 0

i % L = L-1

u % K = K-1

u % K = K-1

all log entries

u % K = 1

i % L = 1

Figure via Erik Bernhardsson - Implicit Matrix Factorization with Hadoop 20

One map task

Distributed

cache:

All user vectors

where u % K = x

Distributed

cache:

Mapper

Emit contributions

Reducer

New vector!

All item vectors

where i % L = y

Map input:

tuples (u, i, count)

where

u % K = x

and

i % L = y

Figure via Erik Bernhardsson - Hadoop suffers from I/O overhead

21

IO Bottleneck - Section name

23 - First Attempt

24

• For each iteration:

– Compute YtY over item vectors and broadcast

– Join user vectors along with all ratings for that user and all item vectors for

which the user rated the item

– Sum up YtCuIY and YtCuPu and solve for optimal user vectors

ratings

user vectors

item vectors

node 1

node 2

node 3

node 4

node 5

node 6 - First Attempt

25

• For each iteration:

– Compute YtY over item vectors and broadcast

– Join user vectors along with all ratings for that user and all item vectors for

which the user rated the item

– Sum up YtCuIY and YtCuPu and solve for optimal user vectors

ratings

user vectors

item vectors

node 1

nod

no e

d

e 2

nod

no e

d

e 3

nod

no e

d

e 4

nod

no e

d

e 5

node 6 - First Attempt

26

• For each iteration:

– Compute YtY over item vectors and broadcast

– Join user vectors along with all ratings for that user and all item vectors for

which the user rated the item

– Sum up YtCuIY and YtCuPu and solve for optimal user vectors

ratings

user vectors

item vectors

node 1

nod

no e

d

e 2

nod

no e

d

e 3

nod

no e

d

e 4

nod

no e

d

e 5

node 6 - First Attempt

27 - First Attempt

28

•Issues:

– Unnecessarily sending multiple copies of item vector to each node

– Unnecessarily shuffling data across cluster at each iteration

– Not taking advantage of Spark’s in memory capabilities! - Second Attempt

29

• For each iteration:

– Compute YtY over item vectors and broadcast

– Group ratings matrix into blocks, and join blocks with necessary user and

item vectors (to avoid multiple item vector copies at each node)

– Sum up YtCuIY and YtCuPu and solve for optimal user vectors

ratings

user vectors

item vectors

node 1

node 2

node 3

node 4

node 5

node 6 - Second Attempt

30

• For each iteration:

– Compute YtY over item vectors and broadcast

– Group ratings matrix into blocks, and join blocks with necessary user and

item vectors (to avoid multiple item vector copies at each node)

– Sum up YtCuIY and YtCuPu and solve for optimal user vectors

ratings

user vectors

item vectors

node 1

node 2

node 3

node 4

node 5

node 6 - Second Attempt

31

• For each iteration:

– Compute YtY over item vectors and broadcast

– Group ratings matrix into blocks, and join blocks with necessary user and

item vectors (to avoid multiple item vector copies at each node)

– Sum up YtCuIY and YtCuPu and solve for optimal user vectors

ratings

user vectors

item vectors

node 1

node 2

node 3

node 4

node 5

node 6 - Second Attempt

32 - Second Attempt

33 - Second Attempt

34

•Issues:

–Still Unnecessarily shuffling data across cluster at each iteration

–Still not taking advantage of Spark’s in memory capabilities! - So, what are we missing?...

35

•Partitioner: Defines how the elements in a key-value pair RDD are

partitioned across the cluster.

user vectors

partition 1

partition 2

partition 3

node 1

node 2

node 3

node 4

node 5

node 6 - So, what are we missing?...

36

•partitionBy(partitioner): Partitions all elements of the same key to the

same node in the cluster, as defined by the partitioner.

user vectors

partition 1

partition 2

partition 3

node 1

node 2

node 3

node 4

node 5

node 6 - So, what are we missing?...

37

•mapPartitions(func): Similar to map, but runs separately on each

partition (block) of the RDD, so func must be of type Iterator[T] =>

Iterator[U] when running on an RDD of type T.

user vectors

partition 1

partition 2

partition 3

function()

function()

function()

node 1

node 2

node 3

node 4

node 5

node 6 - So, what are we missing?...

38

• persist(storageLevel): Set this RDD's storage level to persist (cache)

its values across operations after the first time it is computed.

user vectors

partition 1

partition 2

partition 3

node 1

node 2

node 3

node 4

node 5

node 6 - Third Attempt

39

• Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory

• Build InLink and OutLink mappings for users and items

– InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block

– OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send

these vectors

• For each iteration:

– Compute YtY over item vectors and broadcast

– On each item block, use the OutLink mapping to send item vectors to the necessary user blocks

– On each user block, use the InLink mapping along with the joined item vectors to update vectors

ratings

user vectors

item vectors

node 1

node 2

node 3

node 4

node 5

node 6 - Third Attempt

40

• Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory

• Build InLink and OutLink mappings for users and items

– InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block

– OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send

these vectors

• For each iteration:

– Compute YtY over item vectors and broadcast

– On each item block, use the OutLink mapping to send item vectors to the necessary user blocks

– On each user block, use the InLink mapping along with the joined item vectors to update vectors

ratings

user vectors

item vectors

node 1

node 2

node 3

node 4

node 5

node 6 - Third Attempt

41

• Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory

• Build InLink and OutLink mappings for users and items

– InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block

– OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send

these vectors

• For each iteration:

– Compute YtY over item vectors and broadcast

– On each item block, use the OutLink mapping to send item vectors to the necessary user blocks

– On each user block, use the InLink mapping along with the joined item vectors to update vectors

ratings

user vectors

item vectors

node 1

node 2

node 3

node 4

node 5

node 6 - Third Attempt

42

• Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory

• Build InLink and OutLink mappings for users and items

– InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block

– OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send

these vectors

• For each iteration:

– Compute YtY over item vectors and broadcast

– On each item block, use the OutLink mapping to send item vectors to the necessary user blocks

– On each user block, use the InLink mapping along with the joined item vectors to update vectors

ratings

user vectors

item vectors

node 1

node 2

node 3

node 4

node 5

node 6 - Third Attempt

43

• Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory

• Build InLink and OutLink mappings for users and items

– InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block

– OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send

these vectors

• For each iteration:

– Compute YtY over item vectors and broadcast

– On each item block, use the OutLink mapping to send item vectors to the necessary user blocks

– On each user block, use the InLink mapping along with the joined item vectors to update vectors

ratings

user vectors

item vectors

node 1

node 2

node 3

node 4

node 5

node 6 - Third Attempt

44

• Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory

• Build InLink and OutLink mappings for users and items

– InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block

– OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send

these vectors

• For each iteration:

– Compute YtY over item vectors and broadcast

– On each item block, use the OutLink mapping to send item vectors to the necessary user blocks

– On each user block, use the InLink mapping along with the joined item vectors to update vectors

ratings

user vectors

item vectors

node 1

node 2

node 3

node 4

node 5

node 6 - Third attempt

45 - Third attempt

46 - Third attempt

47 - ALS Running Times

48

Via Xiangrui Meng (Databricks) http://stanford.edu/~rezab/sparkworkshop/slides/xiangrui.pdf - Section name

49

Fin - Section name

50 - Section name

51 - Section name

52 - Section name

53 - Section name

54 - Section name

55 - Section name

56