このページは http://www.slideshare.net/MrChrisJohnson/music-recommendations-at-scale-with-spark の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

2年以上前 (2014/06/30)にアップロードinテクノロジー

Spotify uses a range of Machine Learning models to power its music recommendation features includ...

Spotify uses a range of Machine Learning models to power its music recommendation features including the Discover page, Radio, and Related Artists. Due to the iterative nature of these models they are a natural fit to the Spark computation paradigm and suffer from the IO overhead incurred by Hadoop. In this talk, I review the ALS algorithm for Matrix Factorization with implicit feedback data and how we’ve scaled it up to handle 100s of Billions of data points using Scala, Breeze, and Spark.

- Who am I??

•Chris Johnson

– Machine Learning guy from NYC

– Focused on music recommendations

– Formerly a PhD student at UT Austin - 3

Recommendations at Spotify

!

• Discover (personalized recommendations)

• Radio

• Related Artists

• Now Playing - 4

How can we find good

recommendations?

!

• Manual Curation

!

!

!

• Manually Tag Attributes

!

!

• Audio Content,

Metadata, Text Analysis

!

!

• Collaborative Filtering - 5

How can we find good

recommendations?

!

• Manual Curation

!

!

!

• Manually Tag Attributes

!

!

• Audio Content,

Metadata, Text Analysis

!

!

• Collaborative Filtering - Collaborative Filtering - “The Netflix Prize” 6
- Collaborative Filtering

7

Hey,

I like tracks P, Q, R, S!

Well,

I like tracks Q, R, S, T!

Then you should check out

track P!

Nice! Btw try track T!

Image via Erik Bernhardsson - Section name

8 - Explicit Matrix Factorization

9

•Users explicitly rate a subset of the movie catalog

•Goal: predict how users will rate new movies

Movies

Users

Chris

Inception - Explicit Matrix Factorization

10

•Approximate ratings matrix by the product of low-

dimensional user and movie matrices

•Minimize RMSE (root mean squared error)

Movies

? 3 5 ?

1 ? ? 1

Users

2 ? 3 2

X

Y

? ? ? 5

Inception

Chris

5 2 ? 4

• = user rating for movie

• = bias for user

• = user latent factor vector

• = bias for item

• = item latent factor vector

• = regularization parameter - Implicit Matrix Factorization

11

•Instead of explicit ratings use binary labels

– 1 = streamed, 0 = never streamed

•Minimize weighted RMSE (root mean squared error) using a

function of total streams as weights

1 0 0 0 1 0 0 1

0 0 1 0 0 1 0 0

1 0 1 0 0 0 1 1

Users

X

Y

0 1 0 0 0 1 0 0

0 0 1 0 0 1 0 0

1 0 0 0 1 0 0 1

Songs

• = 1 if user streamed track else 0

• = bias for user

•

• = bias for item

• = user latent factor vector

• = regularization parameter

• =i tem latent factor vector - Alternating Least Squares (ALS)

12

•Instead of explicit ratings use binary labels

– 1 = streamed, 0 = never streamed

•Minimize weighted RMSE (root mean squared error) using a

function of total streams as weights

1 0 0 0 1 0 0 1

0 0 1 0 0 1 0 0

1 0 1 0 0 0 1 1

Users

X

Y

0 1 0 0 0 1 0 0

0 0 1 0 0 1 0 0

1 0 0 0 1 0 0 1

Fix songs

Songs

• = 1 if user streamed track else 0

• = bias for user

•

• = bias for item

• = user latent factor vector

• = regularization parameter

• =i tem latent factor vector - Alternating Least Squares (ALS)

13

•Instead of explicit ratings use binary labels

– 1 = streamed, 0 = never streamed

•Minimize weighted RMSE (root mean squared error) using a

function of total streams as weights

1 0 0 0 1 0 0 1

Solve for users

0 0 1 0 0 1 0 0

1 0 1 0 0 0 1 1

Users

X

Y

0 1 0 0 0 1 0 0

0 0 1 0 0 1 0 0

1 0 0 0 1 0 0 1

Fix songs

Songs

• = 1 if user streamed track else 0

• = bias for user

•

• = bias for item

• = user latent factor vector

• = regularization parameter

• =i tem latent factor vector - Alternating Least Squares (ALS)

14

•Instead of explicit ratings use binary labels

– 1 = streamed, 0 = never streamed

•Minimize weighted RMSE (root mean squared error) using a

function of total streams as weights

1 0 0 0 1 0 0 1

0 0 1 0 0 1 0 0

1 0 1 0 0 0 1 1

Users

X

Y

0 1 0 0 0 1 0 0

0 0 1 0 0 1 0 0

1 0 0 0 1 0 0 1

Songs

Fix users

• = 1 if user streamed track else 0

• = bias for user

•

• = bias for item

• = user latent factor vector

• = regularization parameter

• =i tem latent factor vector - Alternating Least Squares (ALS)

15

•Instead of explicit ratings use binary labels

– 1 = streamed, 0 = never streamed

•Minimize weighted RMSE (root mean squared error) using a

function of total streams as weights

1 0 0 0 1 0 0 1

0 0 1 0 0 1 0 0

1 0 1 0 0 0 1 1

Users

X

Y

0 1 0 0 0 1 0 0

0 0 1 0 0 1 0 0

1 0 0 0 1 0 0 1

Solve for songs

Songs

Fix users

• = 1 if user streamed track else 0

• = bias for user

•

• = bias for item

• = user latent factor vector

• = regularization parameter

• =i tem latent factor vector - Alternating Least Squares (ALS)

16

•Instead of explicit ratings use binary labels

– 1 = streamed, 0 = never streamed

•Minimize weighted RMSE (root mean squared error) using a

function of total streams as weights

Repeat until convergence…

1 0 0 0 1 0 0 1

0 0 1 0 0 1 0 0

1 0 1 0 0 0 1 1

Users

X

Y

0 1 0 0 0 1 0 0

0 0 1 0 0 1 0 0

1 0 0 0 1 0 0 1

Solve for songs

Songs

Fix users

• = 1 if user streamed track else 0

• = bias for user

•

• = bias for item

• = user latent factor vector

• = regularization parameter

• =i tem latent factor vector - Alternating Least Squares (ALS)

17

•Instead of explicit ratings use binary labels

– 1 = streamed, 0 = never streamed

•Minimize weighted RMSE (root mean squared error) using a

function of total streams as weights

Repeat until convergence…

1 0 0 0 1 0 0 1

0 0 1 0 0 1 0 0

1 0 1 0 0 0 1 1

Users

X

Y

0 1 0 0 0 1 0 0

0 0 1 0 0 1 0 0

1 0 0 0 1 0 0 1

Solve for songs

Songs

Fix users

• = 1 if user streamed track else 0

• = bias for user

•

• = bias for item

• = user latent factor vector

• = regularization parameter

• =i tem latent factor vector - Section name

19 - Scaling up Implicit Matrix Factorization 20

with Hadoop - Hadoop at Spotify 2009

21 - Hadoop at Spotify 2014

22

700 Nodes in our London data center - Implicit Matrix Factorization with Hadoop 23

Map step

Reduce step

item vectors

item vectors

item vectors

item%L=0

item%L=1

i % L = L-1

user vectors

u % K = 0

u % K = 0

u % K = 0

...

u % K = 0

u % K = 0

i % L = 0

i % L = 1

i % L = L-1

user vectors

u % K = 1

u % K = 1

...

...

u % K = 1

u % K = 1

i % L = 0

i % L = 1

...

...

...

...

u % K = K-1

u % K = K-1

user vectors

...

...

i % L = 0

i % L = L-1

u % K = K-1

u % K = K-1

all log entries

u % K = 1

i % L = 1

Figure via Erik Bernhardsson - Implicit Matrix Factorization with Hadoop 24

One map task

Distributed

cache:

All user vectors

where u % K = x

Distributed

cache:

Mapper

Emit contributions

Reducer

New vector!

All item vectors

where i % L = y

Map input:

tuples (u, i, count)

where

u % K = x

and

i % L = y

Figure via Erik Bernhardsson - Hadoop suffers from I/O overhead

25

IO Bottleneck - Section name

27 - 28

First Attempt (broadcast everything)

• For each iteration:

1. Compute YtY over item vectors and broadcast

2. Broadcast item vectors

3. Group ratings by user

4. Solve for optimal user vector

ratings

user vectors

item vectors

worker 1

worker 2

worker 3

worker 4

worker 5

worker 6 - 29

First Attempt (broadcast everything)

• For each iteration:

1. Compute YtY over item vectors and broadcast

2. Broadcast item vectors

3. Group ratings by user

4. Solve for optimal user vector

ratings

user vectors

item vectors

YtY

YtY

YtY

YtY

YtY

YtY

worker 1

worker 2

worker 3

worker 4

worker 5

worker 6 - First Attempt (broadcast everything)

30

• For each iteration:

1. Compute YtY over item vectors and broadcast

2. Broadcast item vectors

3. Group ratings by user

4. Solve for optimal user vector

ratings

user vectors

item vectors

YtY

YtY

YtY

YtY

YtY

YtY

worker 1

worker 2

worker 3

worker 4

worker 5

worker 6 - 31

First Attempt (broadcast everything)

• For each iteration:

1. Compute YtY over item vectors and broadcast

2. Broadcast item vectors

3. Group ratings by user

4. Solve for optimal user vector

ratings

user vectors

item vectors

YtY

YtY

YtY

YtY

YtY

YtY

worker 1

worker 2

worker 3

worker 4

worker 5

worker 6 - First Attempt (broadcast everything)

32

• For each iteration:

1. Compute YtY over item vectors and broadcast

2. Broadcast item vectors

3. Group ratings by user

4. Solve for optimal user vector

ratings

user vectors

item vectors

worker 1

worker 2

worker 3

worker 4

worker 5

worker 6 - First Attempt (broadcast everything)

33 - First Attempt (broadcast everything)

34

•Cons:

– Unnecessarily shuffling all data across wire each iteration.

– Not caching ratings data

– Unnecessarily sending a full copy of user/item vectors to all workers. - Second Attempt (full gridify)

35

•Group ratings matrix into K x L, partition, and cache

•For each iteration:

1. Compute YtY over item vectors and broadcast

2. For each item vector send a copy to each rating block in the item % L column

3. Compute intermediate terms for each block (partition)

4. Group by user, aggregate intermediate terms, and solve for optimal user vector

ratings

user vectors

item vectors

worker 1

worker 2

worker 3

worker 4

worker 5

worker 6 - Second Attempt (full gridify)

36

•Group ratings matrix into K x L, partition, and cache

•For each iteration:

1. Compute YtY over item vectors and broadcast

2. For each item vector send a copy to each rating block in the item % L column

3. Compute intermediate terms for each block (partition)

4. Group by user, aggregate intermediate terms, and solve for optimal user vector

ratings

user vectors

item vectors

worker 1

worker 2

worker 3

worker 4

worker 5

worker 6 - Second Attempt (full gridify)

37

•Group ratings matrix into K x L, partition, and cache

•For each iteration:

1. Compute YtY over item vectors and broadcast

2. For each item vector send a copy to each rating block in the item % L column

3. Compute intermediate terms for each block (partition)

4. Group by user, aggregate intermediate terms, and solve for optimal user vector

ratings

user vectors

item vectors

worker 1

worker 2

worker 3

worker 4

worker 5

worker 6 - Second Attempt (full gridify)

38

•Group ratings matrix into K x L, partition, and cache

•For each iteration:

1. Compute YtY over item vectors and broadcast

2. For each item vector send a copy to each rating block in the item % L column

3. Compute intermediate terms for each block (partition)

4. Group by user, aggregate intermediate terms, and solve for optimal user vector

ratings

user vectors

item vectors

YtY

YtY

YtY

YtY

YtY

YtY

worker 1

worker 2

worker 3

worker 4

worker 5

worker 6 - Second Attempt (full gridify)

39

•Group ratings matrix into K x L, partition, and cache

•For each iteration:

1. Compute YtY over item vectors and broadcast

2. For each item vector send a copy to each rating block in the item % L column

3. Compute intermediate terms for each block (partition)

4. Group by user, aggregate intermediate terms, and solve for optimal user vector

ratings

user vectors

item vectors

YtY

YtY

YtY

YtY

YtY

YtY

worker 1

worker 2

worker 3

worker 4

worker 5

worker 6 - Second Attempt (full gridify)

40

•Group ratings matrix into K x L, partition, and cache

•For each iteration:

1. Compute YtY over item vectors and broadcast

2. For each item vector send a copy to each rating block in the item % L column

3. Compute intermediate terms for each block (partition)

4. Group by user, aggregate intermediate terms, and solve for optimal user vector

ratings

user vectors

item vectors

YtY

YtY

YtY

YtY

YtY

YtY

worker 1

worker 2

worker 3

worker 4

worker 5

worker 6 - Second Attempt (full gridify)

41

•Group ratings matrix into K x L, partition, and cache

•For each iteration:

1. Compute YtY over item vectors and broadcast

2. For each item vector send a copy to each rating block in the item % L column

3. Compute intermediate terms for each block (partition)

4. Group by user, aggregate intermediate terms, and solve for optimal user vector

ratings

user vectors

item vectors

worker 1

worker 2

worker 3

worker 4

worker 5

worker 6 - Second Attempt

42 - Second Attempt

43

•Pros

– Ratings get cached and never shuffled

– Each partition only requires a subset of item (or user) vectors in memory each iteration

– Potentially requires less local memory than a “half gridify” scheme

•Cons

- Sending lots of intermediate data over wire each iteration in order to aggregate and solve for optimal vectors

- More IO overhead than a “half gridify” scheme - Third Attempt (half gridify)

44

•Partition ratings matrix into K user (row) and item (column) blocks, partition, and cache

•For each iteration:

1. Compute YtY over item vectors and broadcast

2. For each item vector, send a copy to each user rating partition that requires it (potentially

all partitions)

3. Each partition aggregates intermediate terms and solves for optimal user vectors

ratings

user vectors

item vectors

worker 1

worker 2

worker 3

worker 4

worker 5

worker 6 - Third Attempt (half gridify)

45

•Partition ratings matrix into K user (row) and item (column) blocks, partition, and cache

•For each iteration:

1. Compute YtY over item vectors and broadcast

2. For each item vector, send a copy to each user rating partition that requires it (potentially

all partitions)

3. Each partition aggregates intermediate terms and solves for optimal user vectors

ratings

user vectors

item vectors

worker 1

worker 2

worker 3

worker 4

worker 5

worker 6 - Third Attempt (half gridify)

46

•Partition ratings matrix into K user (row) and item (column) blocks, partition, and cache

•For each iteration:

1. Compute YtY over item vectors and broadcast

2. For each item vector, send a copy to each user rating partition that requires it (potentially

all partitions)

3. Each partition aggregates intermediate terms and solves for optimal user vectors

ratings

user vectors

item vectors

worker 1

worker 2

worker 3

worker 4

worker 5

worker 6 - Third Attempt (half gridify)

47

•Partition ratings matrix into K user (row) and item (column) blocks, partition, and cache

•For each iteration:

1. Compute YtY over item vectors and broadcast

2. For each item vector, send a copy to each user rating partition that requires it (potentially

all partitions)

3. Each partition aggregates intermediate terms and solves for optimal user vectors

ratings

user vectors

item vectors

YtY

YtY

YtY

YtY

YtY

YtY

worker 1

worker 2

worker 3

worker 4

worker 5

worker 6 - Third Attempt (half gridify)

48

•Partition ratings matrix into K user (row) and item (column) blocks, partition, and cache

•For each iteration:

1. Compute YtY over item vectors and broadcast

2. For each item vector, send a copy to each user rating partition that requires it (potentially

all partitions)

3. Each partition aggregates intermediate terms and solves for optimal user vectors

ratings

user vectors

item vectors

YtY

YtY

YtY

YtY

YtY

YtY

worker 1

worker 2

worker 3

worker 4

worker 5

worker 6 - Third Attempt (half gridify)

49

•Partition ratings matrix into K user (row) and item (column) blocks, partition, and cache

•For each iteration:

1. Compute YtY over item vectors and broadcast

2. For each item vector, send a copy to each user rating partition that requires it (potentially

all partitions)

3. Each partition aggregates intermediate terms and solves for optimal user vectors

ratings

user vectors

item vectors

YtY

YtY

YtY

YtY

YtY

YtY

worker 1

worker 2

worker 3

worker 4

worker 5

worker 6 - Third Attempt (half gridify)

50

•Partition ratings matrix into K user (row) and item (column) blocks, partition, and cache

•For each iteration:

1. Compute YtY over item vectors and broadcast

2. For each item vector, send a copy to each user rating partition that requires it (potentially

all partitions)

3. Each partition aggregates intermediate terms and solves for optimal user vectors

ratings

user vectors

item vectors

Note that we removed the extra

shuffle from the full gridify

approach.

YtY

YtY

YtY

YtY

YtY

YtY

worker 1

worker 2

worker 3

worker 4

worker 5

worker 6 - 51

Third Attempt (half gridify)

Actual MLlib code!

•Pros

– Ratings get cached and never shuffled

– Once item vectors are joined with ratings partitions each partition has enough information to solve optimal user

vectors without any additional shuffling/aggregation (which occurs with the “full gridify” scheme)

•Cons

- Each partition could potentially require a copy of each item vector (which may not all fit in memory)

- Potentially requires more local memory than “full gridify” scheme - ALS Running Times

52

•Dataset consisting of Spotify streaming data for 4 Million users and 500k artists

-Note: full dataset consists of 40M users and 20M songs but we haven’t yet successfully run with Spark

•All jobs run using 40 latent factors

•Spark jobs used 200 executors with 20G containers

•Hadoop job used 1k mappers, 300 reducers

Hadoop

Spark (full

Spark (half

gridify)

gridify)

10 hours

3.5 hours

1.5 hours - Section name

54 - Random Learnings

55

•PairRDDFunctions are your friend! - Random Learnings

56

•Kryo serialization faster than java serialization but may require you to

write and/or register your own serializers - Random Learnings

57

•Kryo serialization faster than java serialization but may require you to

write and/or register your own serializers - Random Learnings

58

•Running with larger datasets often results in failed executors and job

never fully recovers - Section name

59

Fin - Section name

60 - Section name

61 - Section name

62 - Section name

63 - Section name

64 - Section name

65