このページは http://www.slideshare.net/WushWu/online-advertising-and-large-scale-model-fitting の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

- Online Advertising and

Large-scale model fitting

Wush Wu

2014-10-24 - Outline

●

Introduction of Online Advertising

●

Handling Real Data

– Data Engineering

– Model Matrix

– Enhance Computation Speed of R

●

Fitting Model to Large Scale Data

– Batch Algorithm – Parallelizing Existed Algorithm

– Online Algorithm – SGD, FTPRL and Learning Rate Schema

●

Display Advertising Challenge - Ad Formats – Pre Roll Video Ads
- Ad Formats – Banner/ Display Ads
- Adwords Search Ads
- Related Content Ads
- Online Advertising is Growing

Rapidly - Why Online Advertising is

Growing?

●

Wide reach

●

Highly informative

●

Target oriented

●

Cost-effective

●

Quick conversion

●

Easy to use

Measurable

Half the money I spend on advertising is wasted; the trouble is

I don't know which half. - How do we measure the online ad?

●

The user behavior on the internet is trackable.

– We know who watches the ad.

– We know who buys the product.

●

We collect data for measurement. - How do we collect the data?
- Performance-based advertising

●

Pricing Model

– Cost-Per-Mille (CPM)

– Cost-Per-Click (CPC)

– Cost-Per-Action (CPA) or Cost-Per-Order (CPO) - To Improve Profit

●

Display the ad with high Click-Through Rate(CTR) * CPC,

or Conversion Rate (CVR) * CPO

●

Estimation of the probability of click (conversion) is the

central problem

– Rule Based

– Statistical Modeling (Machine Learning)

12

10

8

6

4

2

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5 - System

Website

Ad Request

Online

Recommendation

Batch

Website

Ad Delivering

Model Fitting

Log Server - Rule Based

●

Let the advertiser selects the target group

X - Statistical Modeling

●

We log the display and collect the response

●

Features

– Ad

– Channel

– User - Features of Ad

●

Ad type

●

Ad Content

– Text

– Fashion

– Figure

– Health

– Video

– Game - Features of Channel

●

Visibility - Features of User

●

Sex

●

Age

●

Location

●

Behavior - Real Features

Zhang, Weinan and Yuan, Shuai and Wang, Jun and Shen, Xuehua. Real-Time Bidding

Benchmarking with iPinYou Dataset - Know How v.s. Know Why

●

We usually do not study the reason of high CTR

●

Little improvement of accuracy implies large improvement

of profit

●

Predictive Analysis - Data

●

School

●

Commercial

– Static

– Dynamic

– Cleaned

– Error

– Public

– Private - Data Engineering

Impression

Click

+ CLICK_TIME CLIENT_IP CLICKED ADID

2014/05/17 ...

2.17.x.x

133594

2014/05/17 ...

140.112.x.x

134811 - Data Engineering with R

http://wush978.github.io/REngineering/

●

Automation of R Jobs

– Convert R script to command line application

– Learn modern tools such as jenkins

●

Connections between multiple machine

– Learn ssh

●

Logging

– Linux tools: bash redirection, tee

– R package: logging

●

R Error Handling

– try, tryCatch - Characteristic of Data

●

Rare Event

●

Large Amount of Categorical Features

– Binning Numerical Features

●

Features are highly correlated

●

Some features occurs frequently, some occurs rarely - Common Statistical Model for CTR

●

Logistic Regression

●

Gradient Boosted

Regression Tree

– Check xgboost - Logistic Regression

1

P(Click|x)=

=σ (wT x)

1+e−wT x

●

Linear relationship with features

– Fast prediction

– (Relative) Fast Fitting

●

Usually fit the model with L2 regularization - How large is the data?

●

Instances: 10^9

●

Binary features: 10^5 - Subsampling

●

Sampling is useful for:

– Data exploration

– Code testing

●

Sampling might harm the accuracy (profit)

– Rare event

– Some features occurs frequently and some occurs

rarely

●

We do not subsample data so far - Sampling

●

Olivier Chapelle, et. al. Simple and scalable response

prediction for display advertising. - Computation

1

P(Click|x)= 1+e−wTx

wT x - Model Matrix

head(model.matrix(Species ~ ., iris)) - Dense Matrix

●

10^9 instances

●

10^5 binary features

●

10^14 elements for model matrix

●

Size: 4 * 10^14 bytes

– 400 TB

●

In memory is about 10^3 faster than on disk - R and Large Scale Data

●

R cannot handle large scale data

●

R consumes lots of memory - Sparse Matrix
- Sparse Matrix

A ∈ℝm×n and k nonzero elements

Dense Matrix:

[1 0 1 0

0 0 0 0] requires 4mn size

0 0 0 0

0 0 1 0

List:

(1, 1,1) ,(1, 3,1) ,(4, 3,1) requires 12 k size

Compressed List:

i :{1, 3,3 }, p: {2, 0, 0,1}, x :{1, 1,1} requires 8 k +4 m size

j :{1,1, 4 }, p:{1, 0, 2, 0}, x :{1,1,1} requires 8 k +4 n size - Sparse Matrix

●

The size of non-zero could be estimated by the number of

categorical variable

m∼109

n∼105

k∼101×109

Dense Matrix: 4×1014

List: 12×109

Compressed: 12×109 or 8×109+4×105 - Sparse Matrix

●

Sparse matrix is useful for:

– Large amount of categorical data

– Text Analysis

– Tag Analysis - R package: Matrix

m1 <- matrix(0, 5, 5);m1[1, 4] <- 1

m1

library(Matrix)

m2 <- Matrix(0, 5, 5, sparse=TRUE)

m2[1,4] <- 1

m2 - Computation Speed

m1 <- matrix(0, 5, 5);m1[1, 4] <- 1

library(Matrix)

m2 <- Matrix(0, 5, 5, sparse=TRUE)

m2[1,4] <- 1 - Advanced tips: package Rcpp

●

C/C++ uses memory more efficiently

●

Rcpp provides easy interface for R and C/C++

#include <Rcpp.h>

using namespace Rcpp;

// [[Rcpp::export]]

SEXP XTv(S4 m, NumericVector v, NumericVector&

retval) {

//...

} - Two approach of fitting logistic

regression to large-scaled data

●

Batch Algorithm

●

Online Algorithm

– Optimize the log

– Optimize the loss

likelihood globally

function instance per

instance - Batch Algorithm

Negative Loglikelihood:

f (w∣( x y

, y

1,

1) ,⋯ , ( x m

m))

m

=∑ −y log

t

(σ (wT xt))−(1− yt)log(1−σ (wT xt))

t=1

Gradient Decent:

w =w −η∇ f (w )

t +1

t

t

Each update requires scanning all data - Parallelize Existed Batch Algorithm

Rowise Partition

(X

v

1)v=(X1 )

X

X v

2

2

(v

v

X

X

1

2 ) (X1)=v

X

1

1+ v2

2

2

●

We could split data by instances to several machines

●

The matrix-vector multiplication could be parallelized - Framework of Parallelization

●

Hadoop

●

MPI

– Slow for iterative

– If in memory, fast for

algorithm

iterative algorithm

– False tolerance

– No false tolerance

– Good for many

– Good for several

machines

machines - R Package: pbdMPI
- R Package: pbdMPI

●

Easy to install (on ubuntu)

– sudo apt-get install openmpi-bin openmpi-common

libopenmpi-dev

– install.packages("pbdMPI")

●

Easy to develop (compared to Rmpi) - R Package: pbdMPI

library(pbdMPI)

.rank <- comm.rank()

filename <- sprintf("%d.csv", .rank)

data <- read.csv(filename)

target <- reduce(sum(data$value), op="sum")

finalize() - Parallelize Algorithm with pbdMPI

●

Implement functions required for optimization with pbdMPI

– optim requires f and g (gradient of f)

– nlminb requires f, g, and H(hessian of f)

– tron requires f, g, and Hs(H multiply a given vector s) - Some Tips of Optimization

●

Take care of stopping criteria

– A relative threshold might be enough

●

Save the coefficient during iteration and print the value of f

and g with operator <<-

– You can stop the iteration anytime

– Monitor the convergence - Overview
- LinkedIn Way

Deepak Agarwal. Computational Advertising: The LinkedIn Way. CIKM 2013

●

Too many data to fit in single machine

– Billions of observations, mil ion of features

●

A Naive Approach

– Partition the data and run logistic regression for each partition

– Take the mean of the learned coefficients

– Problem: Not guaranteed to converge to the model from single

machine!

●

Alternating Direction Method of Multipliers (ADMM)

– Boyd et al. 2011 (based on earlier work from the 70s) - ADMM

For each nodes, the data and coefficients are different

K

∑ f

2 subject to wk

k ( w k )+ λ2∥w∥2

=w ∀ k

k =1

ρ

wk

f

k 2

t

=argmin

∥wk−w

∥

+1

wk

k ( wk )+ 2

t + ut 2

ρ K

w

2

∑ k

k 2

t

+

∥w −w+u ∥

+1=argminw λ 2∥w∥2

2

t +1

t 2

k =1

uk =uk+wk −w

t +1

t

t +1

t +1 - Update Coefficient

Deepak Agarwal. Computational Advertising: The LinkedIn Way. CIKM 2013

BI

B G DA

D T

A A

T

Part

Pa ition 1

Part

Par ition 2

iti

Partition 3

Partition

Part

Pa ition K

Logi

L

st

ogi ic

Logi

L

st

ogi ic

Logi

L

st

ogi ic

Logi

Lo st

gi ic

Regressio

Reg

n

ressio

Reg

R ressio

eg

n

ressio

Regressi

Reg

on

ressi

Reg

R r

eg essio

e

n

ssio

Consen

Con

sus

sen

Computation

Comp - Update Regularization

Deepak Agarwal. Computational Advertising: The LinkedIn Way. CIKM 2013

BIG DA

D T

A A

T

Partition 1

Partitio

Part

Par ition 2

iti

Partition

Partition 3

Part

Pa ition K

Logis

Lo

t

gis ic

Logi

L

st

ogi ic

Logistic

Log

Logi

Lo st

gi ic

Regre

R

ssio

egre

n

ssio

Reg

R ressio

eg

n

ressio

Regre

R

ssion

egre

Reg

R r

eg essio

e

n

ssio

Consen

Con

sus

sen

Computation

Comp - Our Remark of ADMM

●

ADMM saves the communication between the nodes

●

In our environment, the overhead of communication is

affordable

– ADMM does not enhance the performance of our

system - Online Algorithm

Stochastic Gradient Decent(SGD):

f (w|y , x

t

t )

=− y log

t

(σ (wT xt))−(1− yt)log(1−σ(wT xt))

w

, x

t +1= wt − η ∇ f ( wt| yt

t )

●

Choose an initial value and learning rate

●

Randomly shuffle the instance in the training set

●

Scan the data and update the coefficient

– Repeat until an approximate minimum is obtained - SGD to Follow The Proximal

Regularized Leader

H. Brendan McMahan. Fol ow-the-Regularized-Leader and Mirror Descent:

Equivalence Theorems and L1 Regularization. AISTATS 2011

w

= w −η ∇ f (w

, x )

t +1

t

t

t| yt

t

= argmin ∇ f (w

, x )T w +

w

t| yt

t

1 (w−w

2 η

t )T (w −wt )

t

t

Let g

, x

f

, x

t = f (wt| yt

t ) and g1 :t =∑

(wi|yi i)

i=1

λ

t

w

= argmin gT w+t λ ‖w‖ + 2

‖w−w ‖2

t

∑

+1

w

1 :t

1

1

i 2

2 i=1 - Regret of SGD

H. Brendan McMahan and Matthew Streeter. Adaptive Bound Optimization

for Online Convex Optimization. COLT 2010

T

T

Regret :=∑ f

f

t (wt )−minw ∑ t ( w )

t =1

t=1

Global learning rate achives regret bound O(D M √T )

D is the L diameter of the feasible set

2

M is the L bound of g

2 - Regret of SGD

H. Brendan McMahan and Matthew Streeter. Adaptive Bound Optimization

for Online Convex Optimization. COLT 2010

Per-coordinate Learning Rate:

ηt,i=

α

t

β+√∑ g2s,i

s=1

1 γ

−

achieves regret bound O(√T n 2 )

n is the dimension of w . If w∈[−0.5, 0.5 ]n , D=√n

P( xt,i=1)∼i−γ for some γ∈[1,2) - Comparison of Learning Rate

Schema

Xinran He, et. al. Practical Lessons from Predicting Clicks on Ads at

Facebook. ADKDD 2014. - Google KDD 2013, FTPRL

H. Brendan McMahan, et. al. Ad Click Prediction: a View from the Trenches.

KDD 2013. - Some Remark for FTPRL

●

FTPRL is a general optimization framework.

– We used it successfully to fit neuron network

●

The per-coordinate learning rate greatly improves the

convergence on our data

– SGD works with per-coordinate learning rate

●

The “Proximal” part decreases the accuracy, but

introduces the sparsity - Implementation of FTPRL in R

●

I am not aware of any implementation of online

optimization in R

●

The algorithm is simple. Just write it with a for loop.

●

The overhead of loop is small in C/C++ compared to R

●

I implemented the algorithm in

https://github.com/wush978/BridgewellML/tree/r-pkg

– Call for user

– Contact me if you want to try - FTPRL v.s. TRON
- Batch v.s. Online

Olivier Chapelle, et. al. Simple and scalable response prediction for display

advertising.

●

Batch Algorithm

●

Online Algorithm (mini-

batch)

– Optimize the likelihood

function to a high accuracy

– Optimize the likelihood

once they are in a good

to a rough precision

neighborhood of the

quite fast

optimal solution.

– A handful of passes over

– Quite slow in reaching the

the data.

solution

– Tricky to parallelize

– Straightforward to

generalize batch learning

to distributed environment - Criteo Inc. Hybrid of Online and

Batch

●

For each node, making one online pass over its local data

according to adaptive gradient updates.

●

Average these local weights to be the initial value of L-

BFGS. - Facebook

Xinran He, et. al. Practical Lessons from Predicting Clicks on Ads at

Facebook. ADKDD 2014.

●

Decision Tree (Batch) for Feature Transforms

●

Logistic Regression (Online) - Data Size and Accuracy
- Experiment Designs
- Experiments Result
- Experiment Analysis
- Improving

New Models

New Algorithms

New Features

Experiments

Analysis - Display Advertising Challenge

●

https://www.kaggle.com/c/criteo-display-ad-challenge

●

7 * 10^7 instances

●

13 integer features and 26 categorical features with about

3 * 10^7 levels

●

We were 9th over 718 teams

– We fit the neuron network (2-layer logistic regression) to

the data with FTPRL and dropout - Dropout in SGD

Geoffrey E. Hinton, et. al. Improving neural networks by preventing co-

adaptation of feature detectors. CoRR 2012 - Tools of Large-scale Model Fitting

●

Almost top 10 competitors were implemented algorithm by

themselves

– There is no dominant tool for large-scale model fitting

●

The winner used 20GB memory only. See

https://github.com/guestwalk/kaggle-2014-criteo

●

For single machine, there are some good machine learning library

– LIBLINEAR for linear model (The student in the Lab is no.1)

– xgboost for gradient boosted regression tree (The author is

no.12)

– Vowpal Wabbit - Thanks for your listening