このページは http://www.slideshare.net/OlivierChapelle/wsdm14 の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

2年以上前 (2014/03/05)にアップロードinテクノロジー

Invited talk at WSDM 2014.

Abstract here: http://www.wsdm-conference.org/2014/practice-and-exper...

Invited talk at WSDM 2014.

Abstract here: http://www.wsdm-conference.org/2014/practice-and-experience-talks/

Corresponding paper: http://olivier.chapelle.cc/pub/ngdstone.pdf

- RESPONSE PREDICTION FOR

DISPLAY ADVERTISING

WSDM ’14

Olivier Chapelle - OUTLINE

1. Criteo & Display advertising

2. Modeling

3. Large scale learning

4. Explore / exploit

2 - DISPLAY ADVERTISING

Display Ad

3 - DISPLAY ADVERTISING

• Rapidly growing multi-bil ion dol ar business (30% of internet

advertising revenue in 2013).

• Marketplace between:

– Publishers: sel display opportunities

– Advertisers: pay for showing their ad

• Real Time Bidding:

– Auction amongst advertisers is held at the moment when a user generates

a display opportunity by visiting a publisher’s web page.

4 - BRANDING VS PERFORMANCE

PRICING TYPE

• CPM (Cost Per Mille): advertiser pays per thousand impressions

• CPC (Cost Per Click): advertiser pays only when the user clicks

• CPA (Cost Per Action): advertiser pays only when the user performs a

predefined action such as a purchase.

CAMPAIGN TYPE

• Branding CPM

• Performance based advertising (retargeting) CPC, CPA

CONVERSIONS

eCPM = CPC * predicted clickthrough rate

eCPM = CPA * predicted conversion rate

5 - CRITEO

Bill on

Pay

CPC

on

CPM

Advertisers

Publisher

s

Criteo’s success depends highly on how precisely we can predict

Click Through Rates (CTR) and Conversion Rates (CR)

6 - RECOMMENDATION

Collaborative filtering

Propose related items to a user based

on historical interactions from other

users

Over 50% of Criteo driven sales come

from recommended products the user

had never viewed on advertiser websites

7 - CRITEO NUMBERS

750

EMPLOYEES

10 PB

4COU2

7500 CORES CLUSTER

NTRIES

S

Y

ID

A

B

D

R 18

T

$600

R

E BILLION

P

MILLION

2013 REVENUE

S

R

Y

E

A 2.1

N

D

N

R

250

E

AD CALLS

A

BILLION

P

K

B

PER SECOND

8

8 - OUTLINE

1. Criteo & Display advertising

2. Modeling

3. Large scale learning

4. Explore / exploit

9 - FEATURES

• Three sources of features: user, ad, page

• In this talk: categorical features on ad and page.

Publisher network

Advertiser network

Publisher

Advertiser

Site

Campaign

Url

Ad

Publisher hierarchy

Advertiser hierarchy

10 - HASHING TRICK

• Standard representation of categorical features: “one-hot” encoding

For instance, site feature

0 0 1 0 0 0 0

cnn.com

news.yahoo.com

• Dimensionality equal to the number of different values

– can be very large

• Hashing to reduce dimensionality (made popular by John Langford in VW)

• Dimensionality now independent of number of values

11 - HASHING VS FEATURE SELECTION

• “Small” problem with 35M different values.

• Methods that require a dictionary have a larger model.

12 - QUADRATIC FEATURES

• Outer product between two features.

• Example: between site and advertiser,

Feature is 1 site=finance.yahoo.com & advertiser=bank of america

Publisher network

Advertiser network

Publisher

Advertiser

Site

Campaign

Url

Ad

Similar to a polynomial kernel of degree 2

Large number of values hashing trick

13 - ADVANTAGES OF HASHING

• Practical

– Straightforward implement; no need to maintain dictionaries

• Statistical

– Regularization (infrequent values are washed away by frequent ones)

• Most powerful when combined with quadratic features

Quote of John Langford about hashing

At first it’s scary, then you love it

14 - LEARNING

• Regularized logistic regression

– Vowpal Wabbit open source package

• Regularization with hierarchical features backoff smoothing

Well estimated

Small if rare value

• Negative data subsampled for computational reason

15 - EVALUATION

• Comparison with (Agarwal et al. ’10)

– Probabilistic model for the same display advertising prediction problem

– Leverages the hierarchical structures on the ad and publisher sides

– Sparse prior for smoothing

• Model trained on three weeks of data, tested on the 3 following days

auROC

auPRC

Log likelihood

+ 3.1%

+ 10.0%

+ 7.1%

D. Agarwal et al., Estimating Rates of Rare Events with Multiple Hierarchies through Scalable Log-linear Models, KDD, 2010

16 - BAYESIAN LOGISTIC REGRESSION

• Regularized logistic regression = MAP solution

(Gaussian prior, logistic likelihood)

• Posterior is not Gaussian

• Diagonal Laplace approximation:

with:

and:

17 - MODEL UPDATE

• Needed because ads / campaigns keep changing.

• The posterior distribution of a previously trained model can be used as

the prior for training a new model with a new batch of data

Day 1 Day 2 Day 3

Day 4

Day 5

M0

M1

M2

• Influence of the update frequency (auPRC):

1 day

6 hours

2 hours

+3.7%

+5.1%

+5.8%

18 - OUTLINE

1. Criteo & Display advertising

2. Modeling

3. Large scale learning

4. Explore / exploit

19 - PARALLEL LEARNING

• Large training set

– 2B training samples; 16M parameters

– 400GB (compressed)

• Proposed method: less than one hour with 500 machines

• Optimize:

• SGD is fast on a single machine, but difficult to parallelize.

• Batch (quasi-Newton) methods are straightforward to parallelize

– L-BFGS with distributed gradient computation.

m

20 - ALLREDUCE

• Aggregate and broadcast across nodes

9

13

15

37

37

1

8

7

5

3

4

37 37

37 37

7

5

3

4

• Very few modification to existing code: just insert several AllReduce op.

• Compatible with Hadoop / MapReduce

– Build a spanning tree on the gateway

– Single MapReduce job

– Leverage speculative execution to alleviate the slow node issue

21 - ONLINE INITIALIZATION

• Hybrid approach:

– One pass of online learning on each node

– Average the weights from each node to get a warm start for batch

optimization

• Best of both (online / batch) worlds.

Splice site prediction (Sonnenburg et al. ‘10)

Display advertising

S. Sonnenburg and V. Franc, COFFIN: A Computational Framework for Linear SVMs, ICML 2010

22 - OUTLINE

1. Criteo & Display advertising

2. Modeling

3. Large scale learning

4. Explore / exploit

23 - THOMPSON SAMPLING

• Heuristic to address the Explore / Exploit problem, dating back to

Thompson (1933)

• Simple to implement

• Good performance in practice (Graepel et al. ‘10, Chapelle and Li ‘11)

• Rarely used, maybe because of lack of theoretical guarantee.

Draw model parameter t

Select best action

Observe reward

according to P( D)

according to t

and update model

T. Graepel et al., Web-scale Bayesian click-through rate prediction for sponsored search advertising in Microsoft’s Bing search

engine, ICML 2010

O. Chapel e and L. Li, An Empirical Evaluation of Thompson Sampling, NIPS 2011

24 - E/E SIMULATIONS

MAB with K arms. Best arm has mean reward = 0.5, others have 0.5 − ε.

25 - EVALUATION

• Semi-simulated environment: real input features, but labels generated.

• Set of eligible ads varies from 1 to 5,910. Total ads = 66,373

• Comparison of E/E algorithms:

– 4 days of data

– Cold start

• Algorithms:

– UCB: mean + std. dev.

X candidates

– -greedy

– Thompson sampling

X selected

Learned w

Random w

(ground truth)

Generated Y

model update

26 - RESULTS

• CTR regret (in percentage):

Thompson

UCB

e-greedy

Exploit-only

Random

3.72

4.14

4.98

5.00

31.95

• Regret over time:

27 - OPEN QUESTIONS

• Hashing

– Theoretical performance guarantees

• Low rank matrix factorization

– Better predict on unseen pairs (publisher, advertiser)

• Sample selection bias

– System is trained only on selected ads, but al ads are scored.

– Possible solution: inverse propensity scoring

– But we stil need to bias the training data toward good ads.

• Explore / exploit

– Evaluation framework

– Regret analysis of Thompson’s sampling

– E/E with a budget; with multiple slots; with a delayed feedback

28 - CONCLUSION

• Simple yet efficient techniques for click prediction

• Main difficulty in applied machine learning: avoid the bias (because of

academic papers) toward complex systems

– It’s easy to get lured into building a complex system

– It’s difficult to keep it simple

See paper for more details

Simple and scalable response prediction for display advertising

O. Chapelle, E. Manavoglu, R. Rosales, 2014

29