このページは http://www.slideshare.net/cfregly/spark-summit-east-nyc-meetup-02162016 の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

8ヶ月前 (2016/02/17)にアップロードinテクノロジー

Spark Summit East NYC Meetup 02-16-2016

- Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Apache Spark Meetup to Date約1年前 by Chris Fregly
- Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture約2年前 by Chris Fregly
- Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014約2年前 by Chris Fregly

- DC Spark Users Group March 15 2016 - Spark and Netflix Recommendations8ヶ月前 by Chris Fregly
- Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Apache Spark Meetup to Date約1年前 by Chris Fregly
- Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi, Kafka, and Spark - Chris Fregly6ヶ月前 by Chris Fregly

- Spark and Recommendations

Spark, Streaming, Machine Learning, Graph Processing,

Approximations, Probabilistic Data Structures, NLP

Chris Fregly

Principal Data Solutions Engineer

We’re Hiring! (Only Nice People) advancedspark.com!

Spark-NYC Meetup @ Spark Summit

Thanks, Bloomberg!

Feb 16th, 2016

IBM Spark

Power of data. Simplicity of design. Speed of innovation.

spark.tc - Who Am I?

Streaming Data Engineer

Netflix OSS Committer

Data Solutions Engineer

Apache Contributor

Principal Data Solutions Engineer

IBM Technology Center

Meetup Organizer

Due 2016

Advanced Apache Meetup

Book Author

Advanced .

2

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Advanced Apache Spark Meetup

http://advancedspark.com

Meetup Metrics

Top 5 Most-active Spark Meetup!

2600 Members in just 6 mos!!

2600 Docker downloads (demos)

Meetup Mission

Deep-dive into Spark and related open source projects

Surface key patterns and idioms

Focus on distributed systems, scale, and performance

3

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Live, Interactive Demo!!

Audience Participation Required

(cell phone or laptop)

4

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - demo.advancedspark.com

End User ->

<- Kafka

<- Spark

Streaming

ElasticSearch ->

<- Cassandra,

Spark ML ->

Redis

Data Scientist ->

<- Zeppelin,

iPython

5

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Presentation Outline

Scaling with Parallelism and Composability

Similarity and Recommendations

When to Approximate

Common Algorithms and Data Structures

Common Libraries and Tools

Netflix Recommendations and Data Pipeline

6

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Scaling with Parallelism

O(log n)

Peter

O(log n)

7

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Scaling with Composability

Max (a max b max c max d) == (a max b) max (c max d)

Set Union (a U b U c U d) == (a U b) U (c U d)

Addition (a + b + c + d) == (a + b) + (c + d)

Multiply (a * b * c * d) == (a * b) * (c * d)

Division??

8

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - What about Division?

Division (a / b / c / d) != (a / b) / (c / d)

(3 / 4 / 7 / 8) != (3 / 4) / (7 / 8)

(((3 / 4) / 7) / 8) != ((3 * 8) / (4 * 7))

0.134 != 0.857

Not Composable

What were the Egyptians thinking?!

“Divide like

an Egyptian”

9

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - What about Average?

Divide, Add, Divide?

AVG (3, 5, 5, 7) == 5

Not

Composable

Pairwise AVG

(3 + 5) (5 + 7) 8 12 20

------- + ------- == --- + --- == --- == 10 != 5

2

2 2 2 2

Overall AVG ( value

Add, Add, Add?

Composable!

[3, 1] ((3 + 5) + (5 + 7)) 20

[5, 1] == ----------------------- == --- == 5

[5, 1] ((1 + 2) + 1) 4

[7, 1]

Single Divide at the End?

)

count

Doesn’t need to be Composable!

10

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Presentation Outline

Scaling with Parallelism and Composability

Similarity and Recommendations

When to Approximate

Common Algorithms and Data Structures

Common Libraries and Tools

Netflix Recommendations and Data Pipeline

11

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Similarity

12

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Euclidean Similarity

Exists in Euclidean, flat space

Based on Euclidean distance

Linear measure

Bias towards magnitude

13

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Cosine Similarity

Angular measure

Adjusts for Euclidean magnitude bias

Normalizes to unit vectors

14

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Jaccard Similarity

Set similarity measurement

Set intersection / set union ->

Based on Jaccard distance

Bias towards popularity

15

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Log Likelihood Similarity

Adjusts for popularity bias

Netflix “Shawshank” problem

16

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Word Similarity

Edit Distance

Calculate char diﬀerences between words

Deletes, transposes, replaces, inserts

17

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Document Similarity

TD/IDF

Term Freq / Inverse Document Freq

Used by most search engines

Word2Vec

Words embedded in vector space nearby similars

18

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Similarity Pathway

ie. Closest recommendations between 2 people

19

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Calculating Similarity

Exact Brute-Force

“All-pairs similarity”

aka “Pair-wise similarity”, “Similarity join”

Cartesian O(n^2) shuﬄe and comparison

Approximate

Sampling

Bucketing (aka “Partitioning”, “Clustering”)

Remove data with low probability of similarity

Reduce shuﬄe and comparisons

20

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Bonus: Document Summary

Text Rank

aka “Sentence Rank”

TF/IDF + Similarity Graph + PageRank

Intuition

Surface summary sentences (abstract)

Most similar to all others (TF/IDF + Similarity Graph)

Most influential sentences (PageRank)

21

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Similarity Graph

Vertex is movie, tag, actor, plot summary, etc.

Edges are relationships and weights

22

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Topic-Sensitive PageRank

Graph diﬀusion algorithm

Pre-process graph, add vector of probabilities to each vertex

Probability of landing at this vertex from every other vertex

23

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Recommendations

24

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Basic Terminology

User: User seeking recommendations

Item: Item being recommended

Explicit User Feedback: like, rating, view movie, read profile, search terms

Implicit User Feedback: click, hover, scroll, navigation

Instances: Rows of user feedback/input data

Overfitting: Training a model too closely to the training data & hyperparameters

Hold Out Split: Holding out some of the instances to avoid overfitting

Features: Columns of instance rows (of feedback/input data)

Cold Start Problem: Not enough data to personalize (new)

Hyperparameter: Model-specific config knobs for tuning (tree depth, iterations)

Model Evaluation: Compare predictions to actual values of hold out split

Feature Engineering: Modify, reduce, combine features

25

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Features

Binary Features: True or False

Numeric Discrete Features: Integers

Numeric Features: Real values

Ordinal Features: Maintain order (S -> M -> L -> XL -> XXL)

Temporal Features: Time-based (Time of Day, Binge View)

Categorical Features: Finite, unique categories (sports teams)

Latent Features: Hidden features that arise from within data

26

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Feature Engineering

Dimension Reduction

Reduce number of features (aka “feature space”)

Principle Component Analysis (PCA)

Find principle features that describe the data in terms of variance

Peel the dimensional layers back until you describe the data

Example: One-Hot Encoding

Convert categorical feature values to 0’s, 1’s

Remove any hint of a relationship between the categories

Bears -> 1 Bears -> [1,0,0]

49’ers -> 2 --> 49’ers -> [0,1,0]

1 binary column

per category

Steelers-> 3 Steelers-> [0,0,1]

27

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Non-Personalized Recommendations

28

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Cold Start Problem

“Cold Start” problem

New user, don’t know their pref, must show them something!

Movies with highest-rated actors

Top K Aggregations

Most desirable singles

PageRank of like activity

Facebook social graph

Recommend friend activity

29

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Personalized Recommendations

30

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Clustering (aka. Nearest Neighbors)

User-to-User Clustering

Similar items viewed or rated

Similar viewing pattern (ie. binge or casual)

Item-to-Item Clustering

Similar item tags/metadata (Jaccard Similiarity, Locality Sensitive Hash)

Similar profile text and categories (TF/IDF, Word2Vec, NLP, One-Hot)

Similar images/facial structures (Convolutional Neural Nets, Eigenfaces)

Dating

Site ->

My OKCupid Profile

http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html

My Hinge Profile

31

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Bonus: NLP Conversation Bot

“If your responses to my generic opening

lines are positive, I may read your profile.”

Spark ML and Stanford CoreNLP:

TF/IDF, DecisionTrees, Sentiment

Analysis

32

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - User-to-Item Collaborative Filtering

Matrix Factorization

① Factor the large matrix (left) into 2 smaller matrices (right)

② Smaller matrices, when multiplied, approximate original

③ Fill in the missing values with in the large matrix

④ Surface latent features from within user-item interaction

33

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Item-to-Item Collaborative Filtering

Made famous by Amazon Paper ~2003

Problem

As # of users grew, Matrix Factorization couldn’t scale

Solution

Oﬄine/Batch

Generate itemId -> List[userId] vectors

Online/Real-time

For each item in cart, recommend similar items from vector space

34

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Presentation Outline

Scaling with Parallelism and Composability

Similarity and Recommendations

When to Approximate

Common Algorithms and Data Structures

Common Libraries and Tools

Netflix Recommendations and Data Pipeline

35

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - When to Approximate?

Memory or time constrained queries

Relative vs. exact counts are OK (# errors between then and now)

Using machine learning or graph algos

Inherently probabilistic and approximate

Finding topics in documents (LDA)

Finding similar pairs of users, items, words at scale (LSH)

Finding top influencers (PageRank)

Streaming aggregations (distinct count or top k)

Inherently sloppy means of collecting (at least once delivery)

Approximate as much as you can get away with!

Ask for forgiveness later !!

36

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - When NOT to Approximate?

If you’ve ever heard the term…

“Sarbanes-Oxley”

…in-that-order, at the oﬃce, after 2002.

37

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Presentation Outline

Scaling with Parallelism and Composability

Similarity and Recommendations

When to Approximate

Common Algorithms and Data Structures

Common Libraries and Tools

Netflix Recommendations and Data Pipeline

38

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - A Few Good Algorithms

You can’t handle

the approximate!

39

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Common to These Algos & Data Structs

Low, fixed size in memory

Known error bounds

Store large amount of data

Less memory than Java/Scala collections

Tunable tradeoﬀ between size and error

Rely on multiple hash functions or operations

Size of hash range defines error

40

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Bloom Filter

Set.contains(key): Boolean

“Hash Multiple Times and Flip the Bits Wherever You Land”

41

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Bloom Filter

Approximate set membership for key

False positive: expect contains(), actual !contains()

True negative: expect !contains(), actual !contains()

Elements are only added, never removed

42

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Bloom Filter in Action

set(key)

contains(key): Boolean

Images by @avibryant

TRUE -> maybe contains

FALSE -> definitely does not contain.

43

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - CountMin Sketch

Frequency Count and TopK

“Hash Multiple Times and Add 1 Wherever You Land”

44

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - CountMin Sketch (CMS)

Approximate frequency count and TopK for key

ie. “Heavy Hitters” on Twitter

Matei Zaharia

Martin Odersky

Donald Trump

45

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - CountMin Sketch In Action (TopK, Count)

Use Case: TopK movies using total views

x 2 occurrences of

add(Top Gun, 2)

Binary hash output

“Top Gun” for slightly

(1 element per column)

additional complexity

Top Gun

(x 2)

getCount(Top Gun): Long

Multiple hash functions

Top Gun (1 hash function per row)

Top Gun

…

add(A Few Good Men, 1)

…

…

Overlap Top Gun

Top Gun

(x 2)

A Few

A Few

Good Men

Good Men

…

add(Taps, 1)

Overlap A Few Good Men

Find minimum of all rows

Taps

Taps

Can overestimate,

but never underestimate

Images derived from @avibryant

46

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - HyperLogLog

Count Distinct

“Hash Multiple Times and Uniformly Distribute Where You Land”

47

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - HyperLogLog (HLL)

Approximate count distinct

Slight twist

Not many of these

Special hash function creates uniform distribution

Error estimate

14 bits for size of range

m = 2^14 = 16,384 hash slots

error = 1.04/(sqrt(16,384)) = .81%

48

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - HyperLogLog In Action (Count Distinct)

Use Case: Number of distinct users who view a movie

Top Gun: Hour 1

0

16

user

user

user

user

user

user

7009

1001

2009

3005

3003

3001

Uniform Distribution:

Estimate distinct # of users by

inspecting just the beginning

Top Gun: Hour 2

user

user

user

user

user

user

User

User

2001

4009

3002

7002

1005

6001

8001

8002

0

32

Combine across

diﬀerent scales

Top Gun: Hour 1 + 2

user

user

user

user

user

user

user

user

user

user

user

User

User

user

7009

2001

4009

1001

3002

2009

7002

1005

3005

6001

3003

8001

8002

3001

0

32

49

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Locality Sensitive Hashing

Set Similarity

“Pre-process Items into Buckets, Compare Within Buckets”

50

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Locality Sensitive Hashing (LSH)

Approximate set similarity

Hash designed to cluster similar items

Avoids cartesian all-pairs comparison

Pre-process m rows into b buckets

b << m

Hash items multiple times

Similar items hash to overlapping buckets

Compare just contents of buckets

Much smaller cartesian … and parallel !!

51

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - DIMSUM

Set Similarity

“Pre-process and ignore data that is unlikely to be similar.”

52

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - DIMSUM

“Dimension Independent Matrix Square Using MR”

Remove vectors with low probability of similarity

RowMatrix.columnSimiliarites(threshold)

Twitter DIMSUM Case Study

40% eﬃciency gain over bruce-force Cosine Sim

53

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Presentation Outline

Scaling with Parallelism and Composability

Similarity and Recommendations

When to Approximate

Common Algorithms and Data Structures

Common Libraries and Tools

Netflix Recommendations and Data Pipeline

54

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Common Tools to Approximate

Twitter Algebird

Composable Library

Redis

Distributed Cache

Apache Spark

Big Data Processing

55

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Twitter Algebird

Rooted in Algebraic Fundamentals!

Parallel

Associative

Composable

Examples

Min, Max, Avg

BloomFilter (Set.contains(key))

HyperLogLog (Count Distinct)

CountMin Sketch (TopK Count)

56

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Redis

Implementation of HyperLogLog (Count Distinct)

12KB per item count

2^64 max # of items

Tunable

0.81% error (Tunable)

Add user views for given movie

PFADD TopGun_HLL user1001 user2009 user3005

PFADD TopGun_HLL user3003 user1001

ignore duplicates

Get distinct count (cardinality) of set

PFCOUNT TopGun_HLL

Returns: 4 (distinct users viewed this movie)

Union 2 HyperLogLog Data Structures

PFMERGE TopGun_HLL Taps_HLL

57

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Spark Approximations

Spark Core

RDD.count*Approx()

Spark SQL

PartialResult

approxCountDistinct(column), HyperLogLogPlus

Spark ML

Stratified sampling

PairRDD.sampleByKey(fractions: Double[ ])

DIMSUM sampling

Probabilistic sampling reduces amount of comparison shuﬄe

RowMatrix.columnSimilarities(threshold)

Spark Streaming

A/B testing

StreamingTest.setTestMethod(“welch”).registerStream(dstream)

58

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Demos!

59

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Counting

Exact Count vs. Approx HyperLogLog, CountMin Sketch

60

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - HashSet vs. HyperLogLog (Memory)

61

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - HashSet vs. CountMin Sketch (Memory)

62

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Set Similarity

Bruce Force vs. Locality Sensitive Hashing Similarity

63

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Brute Force Cartesian All Pair Similarity

47 seconds

64

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Locality Sensitive Hash All Pair Similarity

6 seconds

65

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - http://advancedspark.com

or

Download Docker Clone Github

Many More Demos!

66

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Presentation Outline

Scaling with Parallelism and Composability

Similarity and Recommendations

When to Approximate

Common Algorithms and Data Structures

Common Libraries and Tools

Netflix Recommendations and Data Pipeline

67

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Netflix Recommendation & Data Pipeline

From 5 Stars to Trending Now

68

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Netflix Has a Lot of Data

Netflix has a lot of data about a lot of users and a lot of movies.

My favorite movie:

Netflix can use this data to buy new movies.

“Harold and Kumar

Go to White Castle”

Netflix is global.

The UK doesn’t have White Castle.

Renamed my favourite movie to:

This broke my unit tests!

“Harold and Kumar Get the Munchies”

Netflix can use this data to choose original programming.

Netflix knows that a lot of people like politics and Kevin Spacey.

Summary: Buy NFLX Stock!

69

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - $1 Million Netflix Prize (2006-2009)

Goal

Improve movie predictions by 10% (RMSE)

Dataset

(userId, movieId, rating, timestamp)

Test data withheld to calculate RMSE upon submission

Winning algorithm

10.06% improvement (RMSE)

Ensemble of 500+ ML combined with GBDT’s

Computationally impractical

70

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Secrets to the Winning Algorithms

Adjust for the following human bias…

① Alice Eﬀect: rate lower than average user

② Inception Eﬀect: rated higher than average movie

③ Overall mean rating of a movie

④ Number of people who have rated a movie

⑤ Mood, time of day, day of week, season, weather

⑥ Number of days since user’s first rating

⑦ Number of days since movie’s first rating

71

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Netflix Data Pipeline - Then

v1.0!

v2.0!

72

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Netflix Data Pipeline - Now

v3.0!

8 million events per second

73

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Netflix Recommendation Pipeline

Throw away

batch-generated

user factors (U)

74

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Netflix Common ML Algorithms

Logistic Regression

Linear Regression

Gradient Boosted Decision Trees

Random Forest

Matrix Factorization

SVD

Restricted Boltzmann Machines

Deep Neural Nets

Markov Models

LDA

Clustering

Ensembles

75

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Netflix Trending Now

Time of day

Personalized to user (viewing history, past ratings)

Personalized to events (Valentine’s Day)

Number of

Impressions

Calculate

Take Rate

Number of

Plays

“VHS”

76

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Bonus: Pandora Time of Day Recs

Work Days

Play familiar music

User is less likely accept new music

Evenings and Weekends

Play new music

More like to accept new music

77

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Netflix Social Integration

Post to Facebook after movie start (5 mins)

Recommend without needing viewing history

Helps with Cold Start problem

78

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Netflix Search

No results? No problem… Show similar results!

Empty searches are good!

Explicit feedback for future recommendations

Content to buy and produce!

79

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Bonus: Netflix in 2004

Netflix noticed people started to rate movies higher!?

Why?

Significant UI improvements made around that time

Recommendation improvements (Cinematch)

80

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - Thank You!!

Chris Fregly @cfregly

IBM Spark Tech Center

http://spark.tc

San Francisco, California, USA

http://advancedspark.com

Sign up for the Meetup and Book

Contribute to Github Repo

Run all Demos using Docker

Image derived from http://www.duchess-france.org/

Find me: LinkedIn, Twitter, Github, Email, Fax

81

IBM

IBM Spark

Power of data.

Power

Simplicity

of data.

of design.

Simplicity

Speed

of design.

of innovation.

Speed

of innovation.

spark.tc - IBM Spark

Power of data. Simplicity of design. Speed of innovation.

advancedspark.com

@cfregly

IBM Spark

Power of data. Simplicity of design. Speed of innovation.