このページは http://www.slideshare.net/SotaroSugimoto/spring-2016-intern-at-treasure-data の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

4ヶ月前 (2016/06/16)にアップロードin学び

Presentation on intern work: Field-Aware Factorization Machines, Kernelized Passive-Aggressive Cl...

Presentation on intern work: Field-Aware Factorization Machines, Kernelized Passive-Aggressive Classification, ChangeFinder (Anomaly Detection)

- 2016 Spring Intern

@ Treasure Data

2016/4/3 - 2016/6/17

Part 1: Field-Aware Factorization Machines

Part 2: Kernelized Passive-Aggressive

Part 3: ChangeFinder - whoami

Sotaro Sugimoto (杉本 宗太郎)

• U. Tokyo B.S. Physics (2016)

• Georgia Tech M.S. Computational Science & Engineering (2016-2018)

• https://github.com/L3Sota

Facebook (Look for the dog) - What will this talk be about?

• Model-based Predictors

• “Reading the future”

• Estimating the value of an important variable

• Determining whether or not some action will occur

• Statistical Anomaly Detection

• The computer monitors a resource and tells us when “something unnatural”

happens - Part 1: Field-Aware Factorization

Machines

• What we want to achieve

• SVM to FFM and everything in between

• What’s a Field?

• Pros and Cons - FFM: what we want to achieve

• Pr

ediction: Data goes in, predictions come out

• CTR

Prediction result

• Shopping recommendations

Prediction function

Input vector

• Regression & Classification

• Regression: Results are real-valued (

• Classification: Results are binary ( and are common) - Click-Through Rate (CTR) Prediction

• Wil user X click my ad? What percentage of users wil click my ad? -> Find

the probability that a target of an ad wil click through.

Input:

• User ID

• Past ads clicked

• Past conversions made

• Mouse movements

• Favorite websites

Output:

• Whether or not a click-through wil occur by user X during a particular session

• Classification - Shopping Recommendations

• Wil user X buy this product? What products would this user like to see next? -> Predict the rating

that the user would give to unseen items.

Input:

• User ID

• Past items looked at

• Past items bought

• Past items rated

• Mouse movements

• Favorite product categories

Output:

• Expected ratings for each item (i.e. a list of recommended items when ordered by rating from highest to

lowest)

• Regression

• This is not to say that you can’t make a similar classification problem - So that this…

What is this???

I AM NOT A

FATHER

No thanks… - Becomes this!

Very important. VERY.

Important.

Gifts for my girlfriend

AWESOME RECOMMENDATIONS

Dead trees! FABULOUS! - FM’s Roots

• FM is a generalized model.

The point of FM was to combine Linear Classification…

• Support Vector Machines (SVM)

…with Matrix-based Approaches.

• Single Value Decomposition (SVD)

• Matrix Factorization (MF) - Support Vector Machines

• Classification

1. Find a plane splitting category 1 from category 2 (H , H )

2

3

2. Maximize the distance from both categories (H )3

3. New data can be classified with this plane

Image from Wikipedia: https://commons.wikimedia.org/wiki/File:Svm_separating_hyperplanes_(SVG).svg - Support Vector Machines

• C alculation specifics

• Plane is denoted by a vector (the normal vector)

• The prediction function is given by .

• is the inner product.

• When using a kernel, the function becomes

• e.g. d-dimensional Polynomial Kernel:

• New data can be classified with

Image original y from Wikipedia, modified: https://commons.wikimedia.org/wiki/File:Normal_vectors2.svg - FFM’s Roots

• FM is a generalized model.

The point of FM was to combine Linear Classification…

• Support Vector Machines (SVM)

…with Matrix-based Approaches.

• Single Value Decomposition (SVD)

• Matrix Factorization (MF) - Matrix-based

approaches

The difference between SVD and MF (besides the diagonal matrix S) is that MF ignores zero entries in the matrix during factorization, which tends to improve performance.

Image from Qi ta: http://qi ta.com/wwacky/items/b402a1f3770bee2dd13c - Mo

M del lInt

In e

t rac

r ti

ac on Or

O d

r er

Mo

M del Eq

E uati

a o

ti n

Global Bias Single Item Weight

Line

n ar

a Mod

M el

1

Pairwise

Pol

P y

ol 2

y Mod

o el

2

SV

S M

V

1

Ke

K rn

r elize

z d SVM

S

n

SV

S D

V

2

MF

2

FM

n

FFM

2 (n

( ) - Factorization Machines

• N

o easy geometric representation

• The prediction function is given by

.

• Interactions between components are implicitly modeled with

factorized vectors

• For each , define a vector with dimensions.

• is used instead of . Recall Poly2 is .

• But wait…

• This is - Math!

=

1,1 1,2 1,3 1,4 1,5

•

2,1 2,2 2,3 2,4 2,5

3,1 3,2 3,3 3,4 3,5

4,1 4,2 4,3 4,4 4,5

5,1 5,2 5,3 5,4 5,5

= - Factorization Machines

S•ub

stitute in the previous calculations:

Works wonders on sparse data!

• Factorization allows implicit interaction modeling, i.e. we can infer interaction strengths

from similar data

• Factorization vectors only depend on one data point so calculations are .

• In fact, with a sparse representation the complexity is , where is the average number of non-zero

components.

But wait…

• Not as useful for dense data (use SVM for dense data classifications) - Field-Aware Factorization Machines

• A

more powerful FM

• The prediction function is given by

.

• Wait, what changed?

• There is an additional subscript on , known as the field.

• Note: The constant and linear terms remain the same. - Field-Aware Factorization Machines

These are

fields

These are

features - Field-Aware Factorization Machines

(cont.)

•

• We specify a based on the current feature of the input vector and

the field of the other feature .

• In other words, for each pair of features we can specify two vectors ,

one where we use the field of (i.e. ), and another where we use the

field of (i.e. ). - Worked Example: 1 Data Point

• Sotaro went to see Zootopia!

• I haven’t actually seen Zootopia yet.

• Let’s guess what his rating will be. -> Regression

Field

Abbrev.

Feature

Abbrev.

Value

Users

u

L3Sota

s

1

Movies

m

Zootopia z

1

Genre

g

Comedy c

1

Genre

g

Drama

d

1

Price

pp

Price

p

1200 - Linear Model

• A single vector is sufficient to hold all the weights.

Field

Abbrev.

Feature

Abbrev.

Value

Users

u

L3Sota

s

1

Movies

m

Zootopia z

1

Genre

g

Comedy c

1

Genre

g

Drama

d

1

Price

pp

Price

p

1200 - Poly2 Model

Field

Abbrev.

Feature

Abbrev.

Value

Users

u

L3Sota

s

1

Movies

m

Zootopia z

1

Genre

g

Comedy c

1

Genre

g

Drama

d

1

Price

pp

Price

p

1200 - FM Model

Field

Abbrev.

Feature

Abbrev.

Value

Users

u

L3Sota

s

1

Movies

m

Zootopia z

1

Genre

g

Comedy c

1

Genre

g

Drama

d

1

Price

pp

Price

p

1200 - FFM Model

Field

Abbrev.

Feature

Abbrev.

Value

Users

u

L3Sota

s

1

Movies

m

Zootopia z

1

Genre

g

Comedy c

1

Genre

g

Drama

d

1

Price

pp

Price

p

1200 - Pros and Cons: FFM

• P ros

• Higher prediction accuracy (i.e. the model is more expressive than FM)

• Cons

• computation complexity (: number of fields)

where is the field of and is the field of

• Can’t split the inner product into two independent sums! -> Double loop

• FM was .

• Data structures need to understand the field of each component (feature) in

the input vector. -> More memory consumption - Status of FFM within Hivemall

• Pull request merged (#284)

• https://github.com/myui/hivemall/pull/284

• Will probably be in next release(?)

• train_ffm(array<string> x, double y[, const string options])

• Trains the internal FFM model using a (sparse) vector x and target y.

• Training uses Stochastic Gradient Descent (SGD).

• ffm_predict(m.model_id, m.model, data.features)

• Calculates a prediction from the given FFM model and data vector.

• The internal FFM model is referenced as ffm_model m - Part 2: Kernelized Passive-Aggressive

• What we want to achieve

• Quite Similar to SVM

• Pros and Cons - KPA: What we want to achieve

• Prediction: Same as FFM

• Regression & Classification: Same as FFM

• Passive-Aggressive uses a linear model -> similar to Support Vector Machines - Classification

Quite Similar to SVM

• S VM Model is

• Passive-Aggressive Model is

• Additional y, PA uses a margin

mfor classification and regression.

What’s the difference?

• Passive-Aggressive models don’t update

Regression

their weights when a new data point is correctly

classified/a new data point is within the

regression range.

• PA is an online algorithm (real-time learning)

• SVM general y uses batch learning

Images and equations from slides at http://ttic.uchicago.edu/~shai/ppt/PassiveAggressive.ppt - But That’s Regular Passive-Aggressive

W

• h at’s Kernelized PA, then?

• Kernelization means instead of using , we introduce a kernel function which increases

the expressiveness of the algorithm, i.e. .

• This is geometrically interpreted as mapping each data point into a corresponding point in a

higher dimensional space.

• In our case we used a Polynomial Kernel (of degree with constant ) which can be

expressed as follows:

• E.g. when ,

• This gives us a model of higher degree, i.e. a model that has interactions between

features!

• Note: The same methods can be used to make a Kernelized SVM too! - Re

R gr

g e

r ssion?

Mo

M del

Ca

C t

a e

t go

g rie

ri s Mo

M del Eq

E uati

a o

ti n

Or

O d

r er

Global Bias Item/User Bias

Line

n ar

a Mod

M el

N

1

1

Pairwise

Pol

P y

ol 2

y Mod

o el

Y

2

1

SV

S M

V

N

1

1

Ke

K rn

r elize

z d SVM

S

N

n

1

SV

S D

V

Y

2

2

MF

Y

2

2

FM

Y

n

n

FFM

Y

2 (n

( )

n - Visualization
- Pros and Cons: KPA

• P ros

• A higher order model generally means better classification/regression results

• Cons

• A Polynomial Kernel of degree generally has a computational complexity of

• However, this can be avoided, especially where input is sparse! - Status of Kernelized Passive-Aggressive

in Hivemall

• KPA for classification is complete

• Also includes modified PA algorithms PA-I and PA-II in kernelized form

• i.e. KPA-I, KPA-II

• No pull request yet

• https://github.com/L3Sota/hivemall/tree/feature/kernelized_pa

• Didn’t get around to writing the pull request

• Code has been reviewed.

• Includes options for faster processing of the kernel, such as Kernel

Expansion and Polynomial Kernel with Inverted Indices (PKI)

• Don’t ask me why it’s not called PKII - Part 3: ChangeFinder

• What we want to achieve

• How ChangeFinder Works

• What ChangeFinder can and can’t do - Take this…
- …and do this!
- ChangeFinder: what we want to

achieve

• Anomaly/Change-Point Detection: Data goes in, anomalies come out

• What’s the difference? -> Lone outliers are detected as anomalies and

long-lasting/permanent changes in behavior are detected as change-

points.

• Anomalies: Performance statistics (98th percentile response time, CPU usage)

go in; momentary dips in performance (anomalies) may be signs of network

or processing bottlenecks.

• Change-Points: Activity (port 135 traffic, SYN requests, credit card usage) goes

in; explosive increases in activity (change-points) may be signs of an attack

(virus, flood, identity theft). - How ChangeFinder Works

A•no

maly Detection:

1. We assume the data fol ows a pattern and attempt to model it.

2. The current model gives a probability distribution for the next data

point, i.e. the probability that is .

3. Once the next datum arrives, we can calculate a score from the

probability distribution

4. If the score is greater than a preset threshold, an anomaly has been

detected. - How ChangeFinder Works

C•ha

nge-Point Detection:

1. We assume the running mean of the anomaly scores

fol ows a pattern and attempt to model it.

2. The current model gives a probability distribution for the next score, i.e. the

probability that is .

3. Once the next datum arrives, we can calculate a score from the probability distribution

4. If the score is greater than a preset threshold, a change-point has been detected. - How ChangeFinder Works

1.

• We assume an -degree Autoregressive model :

•.: The average of the model

•.: The model matrices, which determine how previous data affects the

next data point

•.: A normally distributed error term following

AR model example graphs obtained from http://paulbourke.net/miscel aneous/ar/ - How ChangeFinder Works

2.

• Given the parameters of the model, we calculate an estimate for

the next data point:

•.Hats denote “statistically estimated value”

3. We then receive a new input , and calculate the estimation error .

Assuming the model parameters are (mostly) correct, this

expression evaluates to , which we know is distributed according

to . - How ChangeFinder Works

4.

• We can therefore calculate the score as

•.Our estimate of the model is never perfect, so we should update the

model parameters each time a new data point comes in!

• We also need to update the model parameters whenever we encounter a

change-point, since the series has completely changed behavior.

5. After calculating the score for , we assume that follows the same

time series and update our model parameter estimates - What ChangeFinder can and can’t do

• ChangeFinder can detect anomalies and change-points.

• ChangeFinder can adapt to slowly changing data without sending false

positives.

• ChangeFinder can be adjusted to be more/less sensitive.

• Window size, Forgetfulness, Detection Threshold

• ChangeFinder can’t distinguish an infinitely large anomaly from a change-

point.

• ChangeFinder can’t detect smal change-points.

• ChangeFinder can’t correctly detect anything at the beginning of the

dataset. - Too man

fa

y

lse positives! - Status of ChangeFinder within Hivemall

• No pull request yet

• https://github.com/L3Sota/hivemall/tree/feature/cf_sdar_focused

• Mostly complete but some issues remain with detection accuracy, esp. at

higher dimensions

• cf_detect(array<double> x[, const string options])

• ChangeFinder expects input one data point (one vector) at a time, and

automatically learns from the data in the order provided while returning

detection results. - How was Interning?

• Educational

• Eclipse

• Maven

• Java

• Contributing to an existing project

• Inspiring

• Cool people doing cool stuff, and I get to join in

• Critical

• Next steps: Code more! Get more experience!

• Shifting from “doing what I’m told” to “think what the next step is”