このページは http://www.slideshare.net/cfregly/dc-spark-users-group-march-15-2016-spark-and-netflix-recommendations の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

8ヶ月前 (2016/03/15)にアップロードinテクノロジー

DC Spark Users Group March 15 2016 - Spark and Netflix Recommendations

- Spark Summit East NYC Meetup 02-16-2016 8ヶ月前 by Chris Fregly
- Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Apache Spark Meetup to Date約1年前 by Chris Fregly
- Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture約2年前 by Chris Fregly

- advancedspark.com

IBM Spark

Power of data. Simplicity of design. Speed of innovation.

spark.tc - Who Am I?

Streaming Data Engineer

Netflix OSS Committer

Data Solutions Engineer

Apache Contributor

Principal Data Solutions Engineer

IBM Technology Center

Meetup Organizer

Due 2016

Advanced Apache Meetup

Book Author

Advanced .

2

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Recent World Tour: Freg-a-Palooza!

London Spark Meetup (Oct 12th)

Oslo Big Data Hadoop Meetup (Nov 19th)

Scotland Data Science Meetup (Oct 13th)

Helsinki Spark Meetup (Nov 20th)

Dublin Spark Meetup (Oct 15th)

Stockholm Spark Meetup (Nov 23rd)

Barcelona Spark Meetup (Oct 20th)

Copenhagen Spark Meetup (Nov 25th)

Madrid Big Data Meetup (Oct 22nd)

Istanbul Spark Meetup (Nov 26th)

Paris Spark Meetup (Oct 26th)

Budapest Spark Meetup (Nov 28th)

Amsterdam Spark Summit (Oct 27th)

Singapore Spark Meetup (Dec 1st)

Brussels Spark Meetup (Oct 30th)

Sydney Spark Meetup (Dec 8th)

Zurich Big Data Meetup (Nov 2nd)

Melbourne Spark Meetup (Dec 9th)

Geneva Spark Meetup (Nov 5th)

Toronto Spark Meetup (Dec 14th)

3

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Advanced Apache Spark Meetup

http://advancedspark.com

Meetup Metrics

Top 5 Most-active Spark Meetup!

2600+ Members in just 6 mos!!

2600+ Docker downloads (demos)

Meetup Mission

Code deep-dive into Spark and related open source projects

Surface key patterns and idioms

Focus on distributed systems, scale, and performance

4

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Live, Interactive Demo!

Audience Participation Required!!

Cell Phone Compatible!!!

http://demo.advancedspark.com

5

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - http://demo.advancedspark.com

End User ->

<- Kafka

<- Spark

Streaming

ElasticSearch ->

<- Cassandra,

Spark ML ->

Redis

Data Scientist ->

<- Zeppelin,

iPython

6

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Presentation Outline

① Scaling

② Similarities

③ Recommendations

④ Approximations

⑤ Netflix Recommendations

7

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Scaling with Paral elism

O(log n)

Peter

O(log n)

Worker

Nodes

8

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Paral elism with Composability

Worker 1

Worker 2

Max

(a max b max c max d) == (a max b) max (c max d)

Set Union (a U b U c U d)

== (a U b) U (c U d)

Addition

(a + b + c + d)

== (a + b) + (c + d)

Multiply

(a * b * c * d)

==

(a * b) * (c * d)

Collect at Driver

What about Division and Average?

9

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - What about Division?

Division (a / b / c / d) != (a / b) / (c / d)

(3 / 4 / 7 / 8) != (3 / 4) / (7 / 8)

(((3 / 4) / 7) / 8) != ((3 * 8) / (4 * 7))

0.134

!=

0.857

Not Composable

What were the Egyptians thinking?!

“Divide like

an Egyptian”

10

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - What about Average?

AVG (3, 5, 5, 7) == 5 Divide, Add, Divide?

Pairwise AVG

Not Composable

(3 + 5)

(5 + 7)

8 12 20

------- + ------- == --- + --- == --- == 10 != 5

2

2

2 2

2

Overall AVG

values

Add, Add, Add?

Composable!

(3, 1)

(3 + 5 + 5 + 7)

20

+ (5, 1)

==

-------------------- == --- == 5

+ (5, 1)

(1 + 1 + 1 + 1)

4

+ (7, 1)

Single-Node Divide at the End?

counts

Doesn’t need to be Composable!

11

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Presentation Outline

① Scaling

② Similarities

③ Recommendations

④ Approximations

⑤ Netflix Recommendations

12

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Similarities

13

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Euclidean Similarity

Exists in Euclidean, flat space

Based on Euclidean distance

Linear measure

Bias towards magnitude

14

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Cosine Similarity

Angular measure

Adjusts for Euclidean magnitude bias

Normalize to unit vectors in al dimensions

org.jblas.

DoubleMatrix

15

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Jaccard Similarity

Set similarity measurement

Set intersection / set union

Based on Jaccard distance

Bias towards popularity

16

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Log Likelihood Similarity

Adjusts for popularity bias

Netflix “Shawshank” problem

17

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Word Similarity

Edit Distance

Misspel ings and autocorrect

Word2Vec

Similar words are defined by similar contexts in vector space

English

Spanish

18

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Demo!

Find Synonyms with Word2Vec

19

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Find Synonyms using Word2Vec

20

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Document Similarity

TF/IDF

Term Freq / Inverse Document Freq

Used by most search engines

Doc2Vec

Similar documents are determined by similar contexts

21

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Bonus! Text Rank Document Summary

Text Rank (aka Sentence Rank)

Surface summary sentences

TF/IDF + Similarity Graph + PageRank

Most similar sentence to al other sentences

TF/IDF + Similarity Graph

Most influential sentences

PageRank

22

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Similarity Pathways (Recommendations)

Best recommendations for 2 (or more) people

“You like Max Max. I like Message in a Bottle.

We might like a movie similar to both.”

Item-to-Item Similarity Graph + Dijkstra Shortest Path

23

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Demo!

Similarity Pathway for Movie Recommendations

24

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Load Movies with Tags into DataFrame

My

Choice

Their

Choice

25

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Calculate Tag-based Movie Similarity

Based on Tags

Jaccard Similarity

(Based on Tag Sets)

Above Jaccard

Similarity Threshold

26

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Create Movie-Tag Similarity Graph

Edge Value

Represents

Jaccard Similarity

(Based on Tag Sets)

27

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Calculate Dijkstra Shortest Pathway

28

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Movies with Tags

My

Choice

Their

Choice

Our

Choice

29

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Calculating Similarity

Exact Brute-Force Similarity

Cartesian Product

O(n^2) shuffle and comparison

aka. Al -pairs, Pair-wise, Similarity Join

Approximate Similarity

Sampling

Bucketing or Clustering

Ignore joins of low-similarity probability

Goal: Reduce shuffle

30

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Similarity Graph

Vertex is movie, tag, actor, plot summary, etc.

Edges are relationships and weights (if provided)

31

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Presentation Outline

① Scaling

② Similarities

③ Recommendations

④ Approximations

① Netflix Recommendations

32

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Recommendations

33

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Basic Terminology

User: User seeking recommendations

Item: Item being recommended

Explicit User Feedback: like, rating, movie view, profile read, search

Implicit User Feedback: click, hover, scrol , navigation

Instances: Rows of user feedback/input data

Overfitting: Training a model too closely to the training data & hyperparameters

Hold Out Split: Holding out some of the instances to avoid overfitting

Features: Columns of instance rows (of feedback/input data)

Cold Start Problem: Not enough data to personalize (new)

Hyperparameter: Model-specific config knobs for tuning (tree depth, iterations)

Model Evaluation: Compare predictions to actual values of hold out split

Feature Engineering: Modify, reduce, combine features

34

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Features

Binary: True or False

Numeric Discrete: Integers

Numeric: Real Values

Binning: Convert Continuous into Discrete (Time of Day->Morning, Afternoon)

Categorical Ordinal: Size (Small->Medium->Large), Ratings (1->5)

Categorical Nominal: Independent, Favorite Sports Teams, Dating Spots

Temporal: Time-based, Time of Day, Binge Viewing

Text: Movie Titles, Genres, Tags, Reviews (Tokenize, Stop Words, Stemming)

Media: Images, Audio, Video

Geographic: (Longitude, Latitude), Geohash

Latent: Hidden Features within Data (Collaborative Filtering)

Derived: Age of Movie, Duration of User Subscription

35

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Feature Engineering

Dimension Reduction

Reduce number of features in feature space

Principle Component Analysis (PCA)

Find principle features that best describe data variance

Peel dimensional layers back

One-Hot Encoding

Convert nominal categorical feature values into 0’s and 1’s

Remove any numerical relationship between categories

Convert Each Item

Bears -> 1

Bears -> [1.0, 0.0, 0.0]

to Binary Vector

with Single 1.0 Column

49’ers -> 2

--> 49’ers -> [0.0, 1.0, 0.0]

Steelers-> 3

Steelers-> [0.0, 0.0, 1.0]

36

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Feature Normalization & Standardization

Goal

Scale features to standard size

Required by many ML algos

Normalize Features

http://www.mathsisfun.com/data/standard-normal-distribution.html

Calculate L1 (or L2, etc) norm, then divide into each element

org.apache.spark.ml.feature.Normalizer

Standardize Features

Apply standard normal transformation

mean == 0, stddev == 1

org.apache.spark.ml.feature.StandardScaler

37

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Non-Personalized Recommendations

38

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Cold Start Problem

“Cold Start” problem

New user, don’t know their preference, must show something!

Movies with highest-rated actors

Top K aggregations

Facebook social graph

Friend-based recommendations

Most desirable singles

PageRank of likes and dislikes

39

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Demo!

GraphFrame PageRank

40

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Dating Site Example: Like Graph

41

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - PageRank of Top Influencers

42

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Personalized Recommendations

43

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - User-to-User Clustering

User Similarity

Time-based

Pattern of viewing (binge or casual)

Time of viewing (am or pm)

Ratings-based

Content ratings or number of views

Average rating relative to others (critical or lenient)

Search-based

Search terms

44

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Item-to-Item Clustering

Item Similarity

Profile text (TF/IDF, Word2Vec, NLP)

Categories, tags, interests (Jaccard Similarity, LSH)

Images, facial structures (Neural Nets, Eigenfaces)

Dating Site Example: Items == Users!

My OKCupid Profile

My Hinge Profile

http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html

45

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Bonus: NLP Conversation Starter Bot

“If your responses to my generic opening

lines are positive, I may read your profile.”

Spark ML, Stanford CoreNLP,

TF/IDF, DecisionTrees, Sentiment

http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html

46

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Bonus: Demo!

Spark + Stanford CoreNLP Sentiment Analysis

47

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Bonus: Top 100 Country Song Sentiment

48

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Bonus: Surprising Results…?!

49

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Item-to-Item Based Recommendations

Based on Metadata: Genre, Description, Cast, City

50

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Demo!

Item-to-Item-based Recommendations

One-Hot Encoding + K-Means Clustering

51

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Convert Movie Tags to Feature Vectors

52

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Cluster Using Movie-Tag Feature Vectors

53

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Analyze Movie Tag Clusters

54

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - User-to-Item Col aborative Filtering

Matrix Factorization

① Factor the large matrix (left) into 2 smal er matrices (right)

② Lower-rank matrices approximate original when multiplied

③ Fil in the missing values of the large matrix

④ Surface k (rank) latent features from user-item interactions

55

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Item-to-Item Col aborative Filtering

Famous Amazon Paper circa 2003

Problem

As users grew, user-to-item col aborative filtering didn’t scale

Solution

Item-to-item similarity, nearest neighbors

Offline (Batch)

Generate itemId->List[userId] vectors

Online (Real-time)

From cart, recommend nearest-neighbors in vector space

56

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Demo!

Col aborative Filtering-based Recommendations

57

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Fitting the Matrix Factorization Model

58

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Show ItemFactors Matrix from ALS

59

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Show UserFactors Matrix from ALS

60

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Generating Individual Recommendations

61

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Generating Batch Recommendations

62

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Clustering + Col aborative Filtering Recs

Cluster matrix output from Matrix Factorization

Latent features derived from user-to-item interactions

Item-to-Item Similarity

Cluster item-factor matrix->

User-to-User Similarity

<-Cluster user-factor matrix

63

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Demo!

Clustering + Col aborative Filtering-based Recommendations

64

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Show ItemFactors Matrix from ALS

65

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Convert to Item Factors -> mllib.Vector

Required by K-Means Clustering Algorithm

66

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Fit and Evaluate K-Means Cluster Model

K = 5 Clusters

Measures Closeness

Of Points Within Clusters

67

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Netflix Genres and Clusters

Typical Genres

Documentary, Romance, Comedy, Horror, Action, Adventure

Latent (Hidden) Clusters

Emotional y-Independent Dramas for Hopeless Romantics

Witty Dysfunctional-Family TV Animated Comedies

Romantic Crime Movies based on Classic Literature

Latin American Forbidden-Love Movies

Critically-acclaimed Emotional Drug Movie

Cerebral Military Movie based on Real Life

Sentimental Movies about Horses for Ages 11-12

Gory Canadian Revenge Movies

Raunchy Mad Scientist Comedy

68

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Demo!

Personalized PageRank

69

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Personalized PageRank

70

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Personalized PageRank (No Outbound)

71

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Presentation Outline

① Scaling

② Similarities

③ Recommendations

④ Approximations

⑤ Netflix Recommendations

72

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - When to Approximate?

Memory or time constrained queries

Relative vs. exact counts are OK (# errors between then and now)

Using machine learning or graph algos

Inherently probabilistic and approximate

Finding topics in documents (LDA)

Finding similar pairs of users, items, words at scale (LSH)

Finding top influencers (PageRank)

Streaming aggregations

Inherently sloppy col ection (exactly once?)

Approximate as much as you can get away with!

Ask for forgiveness later !!

73

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - When NOT to Approximate?

If you’ve ever heard the term…

“Sarbanes-Oxley”

…at the office.

74

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - A Few Good Algorithms

You can’t handle

the approximate!

75

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Common to These Algos & Data Structs

Low, fixed size in memory

Known error bounds

Store large amount of data

Less memory than Java/Scala col ections

Tunable tradeoff between size and error

Rely on multiple hash functions or operations

Size of hash range defines error

76

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Bloom Filter

Set.contains(key): Boolean

“Hash Multiple Times and Flip the Bits Wherever You Land”

77

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Bloom Filter

Approximate Set.contains(key)

No means No, Yes means Maybe

Elements can only be added

Never updated or removed

78

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Bloom Filter in Action

set(key)

contains(key): Boolean

Images by @avibryant

Set.contains(key): TRUE -> maybe contains

Set.contains(key): FALSE -> definitely does not contain.

79

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - CountMin Sketch

Frequency Count and TopK

“Hash Multiple Times and Add 1 Wherever You Land”

80

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - CountMin Sketch (CMS)

Approximate frequency count and TopK for key

ie. “Heavy Hitters” on Twitter

Matei Zaharia

Martin Odersky

Donald Trump

81

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - CountMin Sketch In Action (TopK,

Coun

Ust)e Case: TopK movies using total views

x 2 occurrences of

add(Top Gun, 2)

Binary hash output

“Top Gun” for slightly

(1 element per column)

additional complexity

Top Gun

(x 2)

getCount(Top Gun): Long

Multiple hash functions

Top Gun (1 hash function per row)

Top Gun

…

add(A Few Good Men, 1)

…

Overlap Top Gun

Top Gun

(x 2)

A Few

A Few

Good Men

Good Men

…

add(Taps, 1)

Overlap A Few Good Men

Find minimum of all rows

Taps

Taps

Can overestimate,

but never underestimate

Images derived from @avibryant

82

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - HyperLogLog

Count Distinct

“Hash Multiple Times and Uniformly Distribute Where You Land”

83

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - HyperLogLog (HLL)

Approximate count distinct

Slight twist

Not many of these

Special hash function creates uniform distribution

Error estimate

14 bits for size of range

m = 2^14 = 16,384 hash slots

error = 1.04/(sqrt(16,384)) = .81%

84

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - HyperLogLog In Action (Count Distinct)

Use Case: Number of distinct users who view a movie

Top Gun: Hour 1

0

16

user

user

user

user

user

user

7009

1001

2009

3005

3003

3001

Uniform Distribution:

Estimate distinct # of users by

inspecting just the beginning

Top Gun: Hour 2

user

user

user

user

user

user

User

User

2001

4009

3002

7002

1005

6001

8001

8002

0

32

Combine across

different scales

Top Gun: Hour 1 + 2

user

user

us u

erser

user

user

user

us us

er er

user

user

User

useUs

r er

7009

2001

4009

1001

3002

2009

7002

1005

3005

6001

3003

8001

8002

3001

0

32

85

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Locality Sensitive Hashing

Set Similarity

“Pre-process Items into Buckets, Compare Within Buckets”

86

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Locality Sensitive Hashing (LSH)

Approximate set similarity

Hash designed to cluster similar items

Avoids cartesian all-pairs comparison

Pre-process m rows into b buckets

b << m

Hash items multiple times

Similar items hash to overlapping buckets

Compare just contents of buckets

Much smal er cartesian … and paral el !!

87

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - DIMSUM

Set Similarity

“Pre-process and ignore data that is unlikely to be similar.”

88

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - DIMSUM

“Dimension Independent Matrix Square Using MR”

Remove vectors with low probability of similarity

RowMatrix.columnSimiliarites(threshold)

Twitter DIMSUM Case Study

40% efficiency gain over bruce-force Cosine Sim

89

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Common Tools to Approximate

Twitter Algebird

Composable Library

Redis

Distributed Cache

Apache Spark

Big Data Processing

90

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Twitter Algebird

Rooted in Algebraic Fundamentals!

Paral el

Associative

Composable

Examples

Min, Max, Avg

BloomFilter (Set.contains(key))

HyperLogLog (Count Distinct)

CountMin Sketch (TopK Count)

91

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Redis

Implementation of HyperLogLog (Count Distinct)

12KB per item count

2^64 max # of items

Tunable

0.81% error (Tunable)

Add user views for given movie

PFADD TopGun_HLL user1001 user2009 user3005

PFADD TopGun_HLL user3003 user1001

ignore duplicates

Get distinct count (cardinality) of set

PFCOUNT TopGun_HLL

Returns: 4 (distinct users viewed this movie)

Union 2 HyperLogLog Data Structures

PFMERGE TopGun_HLL Taps_HLL

92

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Spark Approximations

Spark Core

RDD.count*Approx()

Spark SQL

PartialResult

approxCountDistinct(column)

HyperLogLogPlus

Spark ML

Stratified sampling

PairRDD.sampleByKey(fractions: Double[ ])

DIMSUM sampling

Probabilistic sampling reduces amount of shuffle

RowMatrix.columnSimilarities(threshold)

93

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Demo!

Exact Count vs. Approximate HLL and CMS Count

94

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - HashSet vs. HyperLogLog (Memory)

95

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - HashSet vs. CountMin Sketch (Memory)

96

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Demo!

Exact Similarity vs. Approximate LSH Similarity

97

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Brute Force Cartesian Al Pair Similarity

47 seconds

98

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Locality Sensitive Hash Al Pair Similarity

6 seconds

99

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - http://advancedspark.com

or

Download Docker

Clone on Github

Many More Demos!

100

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Presentation Outline

① Scaling

② Similarities

③ Recommendations

④ Approximations

⑤ Netflix Recommendations

101

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Netflix Recommendations

From Ratings to Real-time

102

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Netflix Has a Lot of Data

Netflix has a lot of data about a lot of users and a lot of movies.

My favorite movie:

Netflix can use this data to buy new movies.

“Harold and Kumar

Go to White Castle”

Netflix is global.

The UK doesn’t have White Castle.

Renamed my favourite movie to:

This broke my unit tests!

“Harold and Kumar

Get the Munchies”

Netflix can use this data to choose original programming.

Netflix knows that a lot of people like politics and Kevin Spacey.

Summary: Buy NFLX Stock!

103

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Netflix Data Pipeline - Then

v1.0

v2.0

104

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Netflix Data Pipeline – Now (Keystone)

Auto-scaling,

Fault tolerance

v3.0

SAMZA

Splits high and

EC2 D2XL

normal priority

Disk: 6 TB, 475 MB/s

RAM: 30 G

A/B Tests,

Network: 700 Mbps

Trending Now

9 million events per second

22 GB per second!!

105

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Netflix Recommendation Data Pipeline

Throw away

Keep video

batch-generated

factors (V)

user factors (U)

106

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Netflix Trending Now (Time-based Recs)

Uses Spark Streaming

Personalized to user (viewing history, past ratings)

Learns and adapts to events (Valentine’s Day)

Number of

Impressions

Calculate

Take Rate

Number of

Plays

“VHS”

107

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Bonus: Pandora Time-based Recs

Work Days

Play familiar music

User is less likely accept new music

Evenings and Weekends

Play new music

More like to accept new music

108

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - $1 Mil ion Netflix Prize (2006-2009)

Goal

Improve movie predictions by 10% (Root Mean Sq Error)

Test data withheld to calculate RMSE upon submission

5-star Ratings Dataset

(userId, movieId, rating, timestamp)

Winning algorithm(s)

10.06% improvement (RMSE)

Ensemble of 500+ ML combined with GBDT’s

Computational y impractical

109

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Secrets to the Winning Algorithms

Adjust for the fol owing human bias…

① Alice effect: user rates lower than avg

② Inception effect: movie rated higher than avg

③ Overal mean rating of a movie

④ Number of people who have rated a movie

⑤ Number of days since user’s first rating

⑥ Number of days since movie’s first rating

⑦ Mood, time of day, day of week, season, weather

110

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Netflix Common ML Algorithms

Logistic Regression

Linear Regression

Gradient Boosted Decision Trees

Random Forest

Matrix Factorization

SVD

Restricted Boltzmann Machines

Deep Neural Nets

Markov Models

LDA

Clustering

Ensembles!

111

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Netflix Genres and Clusters

Typical Genres

Documentaries, Romance Comedies, Horror, Action, Adventure

Latent (Hidden) Clusters

Emotional y-Independent Dramas for Hopeless Romantics

Witty Dysfunctional-Family TV Animated Comedies

Romantic Crime Movies based on Classic Literature

Latin American Forbidden-Love Movies

Critically-acclaimed Emotional Drug Movie

Cerebral Military Movie based on Real Life

Sentimental Movies about Horses for Ages 11-12

Gory Canadian Revenge Movies

Raunchy Mad Scientist Comedy

112

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Netflix Social Integration

Post to Facebook after movie start (5 mins)

Recommend to new users based on friends

Helps with Cold Start problem

113

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Netflix Search

No results? No problem… Show similar results!

Utilize extensive DVD Catalog

Metadata search (ElasticSearch)

Named entity recognition (NLP)

Empty searches are opportunity!

Explicit feedback for future recommendations

Content to buy and produce!

114

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Higher Ratings in 2004?

2004, Netflix noticed higher ratings on average

Some possible reasons why…

① Significant UI improvements deployed

② New recommendation engine deployed

③

115

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - Thank You, Everyone!!

Chris Fregly @cfregly

IBM Spark Tech Center

San Francisco, California, USA

http://advancedspark.com

Sign up for the Meetup and Book

Contribute to Github Repo

Run all Demos using Docker

Image derived from http://www.duchess-france.org/

Find me: LinkedIn, Twitter, Github, Email, Fax

116

IBM Spark

Power

e of

of d

ata

t . .Si

mplici

c ty of

of d

esign.

gn .Sp

ee

e d

e of

of inn

n ov

n atiton.

on

spark.t

. c

t - IBM Spark

Power of data. Simplicity of design. Speed of innovation.

http://advancedspark.com

@cfregly

IBM Spark

Power of data. Simplicity of design. Speed of innovation.

spark.tc