このページは http://www.slideshare.net/tdunning/news-frommahout20130305 の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

3年以上前 (2013/03/05)にアップロードinテクノロジー

Presentation to the NYC HUG on March 5, 2013 regarding the upcoming Mahout release.

- Apache Drill3年以上前 by Ted Dunning
- Hcj 2013-01-214年弱前 by Ted Dunning
- Drill at the Chug 9-19-12約4年前 by Ted Dunning

- News From Mahout

©MapR Technologies - Confidential

1 - whoami – Ted Dunning

Chief Application Architect, MapR Technologies

Committer, member, Apache Software Foundation

– particularly Mahout, Zookeeper and Drill

(we’re hiring)

Contact me at

tdunning@maprtech.com

tdunning@apache.com

ted.dunning@gmail.com

@ted_dunning

©MapR Technologies - Confidential

2 - Slides and such (available late tonight):

– http://www.mapr.com/company/events/nyhug-03-05-2013

Hash tags: #mapr #nyhug #mahout

©MapR Technologies - Confidential

3 - New in Mahout

0.8 is coming soon (1-2 months)

gobs of fixes

QR decomposition is 10x faster

– makes ALS 2-3 times faster

May include Bayesian Bandits

Super fast k-means

– fast

– online (!?!)

©MapR Technologies - Confidential

4 - New in Mahout

0.8 is coming soon (1-2 months)

gobs of fixes

QR decomposition is 10x faster

– makes ALS 2-3 times faster

May include Bayesian Bandits

Super fast k-means

– fast

– online (!?!)

– fast

Possible new edition of MiA coming

– Japanese and Korean editions released, Chinese coming

©MapR Technologies - Confidential

5 - New in Mahout

0.8 is coming soon (1-2 months)

gobs of fixes

QR decomposition is 10x faster

– makes ALS 2-3 times faster

May include Bayesian Bandits

Super fast k-means

– fast

– online (!?!)

– fast

Possible new edition of MiA coming

– Japanese and Korean editions released, Chinese coming

©MapR Technologies - Confidential

6 - Real-time Learning

©MapR Technologies - Confidential

7 - W e have a p rod uct

to sell …

from a w eb -site

©MapR Technologies - Confidential

8 - Wha

Wh t

a tta

t g

a -

What

a t

line?

pictu

ct r

u e

r ?

Bogus Dog Food is the Best!

Now available in handy 1 ton

bags!

Buy 5!

Wh

W at

a tca

c l

a lto

t

acti

c on?

©MapR Technologies - Confidential

9 - The Challenge

Design decisions affect probability of success

– Cheesy web-sites don’t even sell cheese

The best designers do better when allowed to fail

– Exploration juices creativity

But failing is expensive

– If only because we could have succeeded

– But also because offending or disappointing customers is bad

©MapR Technologies - Confidential

10 - More Challenges

Too many designs

– 5 pictures

– 10 tag-lines

– 4 calls to action

– 3 back-ground colors

=> 5 x 10 x 4 x 3 = 600 designs

It gets worse quickly

– What about changes on the back-end?

– Search engine variants?

– Checkout process variants?

©MapR Technologies - Confidential

11 - Example – AB testing in real-time

I have 15 versions of my landing page

Each visitor is assigned to a version

– Which version?

A conversion or sale or whatever can happen

– How long to wait?

Some versions of the landing page are horrible

– Don’t want to give them traffic

©MapR Technologies - Confidential

12 - A Quick Diversion

You see a coin

– What is the probability of heads?

– Could it be larger or smaller than that?

I flip the coin and while it is in the air ask again

I catch the coin and ask again

I look at the coin (and you don’t) and ask again

Why does the answer change?

– And did it ever have a single value?

©MapR Technologies - Confidential

13 - A Philosophical Conclusion

Probability as expressed by humans is subjective and depends on

information and experience

©MapR Technologies - Confidential

14 - I Dunno

©MapR Technologies - Confidential

15 - 5 heads out of 10 throws

©MapR Technologies - Confidential

16 - 2 heads out of 12 throws

©MapR Technologies - Confidential

17 - So now you und erstand

B ayesian p rob ab ility

©MapR Technologies - Confidential

18 - Another Quick Diversion

Let’s play a shell game

This is a special shell game

It costs you nothing to play

The pea has constant probability of being under each shell

(trust me)

How do you find the best shell?

How do you find it while maximizing the number of wins?

©MapR Technologies - Confidential

19 - P ause for short con-

gam e

©MapR Technologies - Confidential

20 - Interim Thoughts

Can you identify winners or losers without trying them out?

Can you ever completely eliminate a shell with a bad streak?

Should you keep trying apparent losers?

©MapR Technologies - Confidential

21 - So now you und erstand

m ulti-arm ed b and its

©MapR Technologies - Confidential

22 - Conclusions

Can you identify winners or losers without trying them out?

No

Can you ever completely eliminate a shell with a bad streak?

No

Should you keep trying apparent losers?

Yes, but at a decreasing rate

©MapR Technologies - Confidential

23 - Is there an op tim um

strategy?

©MapR Technologies - Confidential

24 - Bayesian Bandit

Compute distributions based on data so far

Sample p , p and p from these distributions

1

2

2

Pick shell i where i = argmax p

i

i

Lemma 1: The probability of picking shell i will match the

probability it is the best shel

Lemma 2: This is as good as it gets

©MapR Technologies - Confidential

25 - And it works!

0.12

0.11

0.1

0.09

0.08

0.07

t

re

g

0.06

re

ε

- greedy, ε

= 0.05

0.05

0.04

Bayesian Bandit with Gamma- Normal

0.03

0.02

0.01

0 0

100

200

300

400

500

600

700

800

900

1000

1100

n

©MapR Technologies - Confidential

26 - Video Demo

©MapR Technologies - Confidential

27 - The Code

Select an alternative

n = dim(k)[1]

p0 = rep(0, length.out=n)

for (i in 1:n) {

p0[i] = rbeta(1, k[i,2]+1, k[i,1]+1)

}

return (which(p0 == max(p0)))

Select and learn

for (z in 1:steps) {

i = select(k)

j = test(i)

k[i,j] = k[i,j]+1

}

return (k)

But we already know how to count!

©MapR Technologies - Confidential

28 - The Basic Idea

We can encode a distribution by sampling

Sampling allows unification of exploration and exploitation

Can be extended to more general response models

©MapR Technologies - Confidential

29 - The Original Problem

x2

x1

Bogus Dog Food is the Best!

Now available in handy 1 ton

bags!

Buy 5!

x3

©MapR Technologies - Confidential

30 - Response Function

p(win) w

xii

i

1

y

0.5

0

- 6

- 4

- 2

0

2

4

6

x

©MapR Technologies - Confidential

31 - Generalized Banditry

Suppose we have an infinite number of bandits

– suppose they are each labeled by two real numbers x and y in [0,1]

– also that expected payoff is a parameterized function of x and y

E z f (x, y| )

– now assume a distribution for θ that we can learn online

Selection works by sampling θ, then computing f

Learning works by propagating updates back to θ

– If f is linear, this is very easy

– For special other kinds of f it isn’t too hard

Don’t just have to have two labels, could have labels and context

©MapR Technologies - Confidential

32 - Context Variables

x2

x1

Bogus Dog Food is the Best!

Now available in handy 1 ton

bags!

Buy 5!

x3

user.ge

g o

env

n .vtime

env

n .vday

a _

y of_we

w ek

env

n .vwe

w eke

k nd

©MapR Technologies - Confidential

33 - Caveats

Original Bayesian Bandit only requires real-time

Generalized Bandit may require access to long history for learning

– Pseudo online learning may be easier than true online

Bandit variables can include content, time of day, day of week

Context variables can include user id, user features

Bandit × context variables provide the real power

©MapR Technologies - Confidential

34 - Y ou can d o this

yourself!

©MapR Technologies - Confidential

35 - Super-fast k-means Clustering

©MapR Technologies - Confidential

36 - Rationale

©MapR Technologies - Confidential

37 - What is Quality?

Robust clustering not a goal

– we don’t care if the same clustering is replicated

Generalization is critical

Agreement to “gold standard” is a non-issue

©MapR Technologies - Confidential

38 - An Example

©MapR Technologies - Confidential

39 - An Example

©MapR Technologies - Confidential

40 - Diagonalized Cluster Proximity

©MapR Technologies - Confidential

41 - Clusters as Distribution Surrogate

©MapR Technologies - Confidential

42 - Clusters as Distribution Surrogate

©MapR Technologies - Confidential

43 - Theory

©MapR Technologies - Confidential

44 - For Example

2(X) 1 2(X)

4

2 5

Grouping these

two clusters

seriously hurts

squared distance

©MapR Technologies - Confidential

45 - Algorithms

©MapR Technologies - Confidential

46 - Typical k-means Failure

Selecting two seeds

here cannot be

fixed with Lloyds

Result is that these two

clusters get glued

together

©MapR Technologies - Confidential

47 - Ball k-means

Provably better for highly clusterable data

Tries to find initial centroids in each “core” of each real clusters

Avoids outliers in centroid computation

initialize centroids randomly with distance maximizing tendency

for each of a very few iterations:

for each data point:

assign point to nearest cluster

recompute centroids using only points much closer than closest cluster

©MapR Technologies - Confidential

48 - Still Not a Win

Ball k-means is nearly guaranteed with k = 2

Probability of successful seeding drops exponentially with k

Alternative strategy has high probability of success, but takes

O(nkd + k3d) time

©MapR Technologies - Confidential

49 - Still Not a Win

Ball k-means is nearly guaranteed with k = 2

Probability of successful seeding drops exponentially with k

Alternative strategy has high probability of success, but takes

O( nkd + k3d ) time

But for big data, k gets large

©MapR Technologies - Confidential

50 - Surrogate Method

Start with sloppy clustering into lots of clusters

κ = k log n clusters

Use this sketch as a weighted surrogate for the data

Results are provably good for highly clusterable data

©MapR Technologies - Confidential

51 - Algorithm Costs

Surrogate methods

– fast, sloppy single pass clustering with κ = k log n

– fast sloppy search for nearest cluster,

O(d log κ) = O(d (log k + log log n)) per point

– fast, in-memory, high-quality clustering of κ weighted centroids

O(κ k d + k3 d) = O(k2 d log n + k3 d) for smal k, high quality

O(κ d log k) or O(d log κ log k) for larger k, looser quality

– result is k high-quality centroids

• Even the sloppy surrogate may suffice

©MapR Technologies - Confidential

52 - Algorithm Costs

Surrogate methods

– fast, sloppy single pass clustering with κ = k log n

– fast sloppy search for nearest cluster,

O(d log κ) = O(d ( log k + log log n )) per point

– fast, in-memory, high-quality clustering of κ weighted centroids

O(κ k d + k3 d) = O(k2 d log n + k3 d) for smal k, high quality

O(κ d log k) or O( d log k ( log k + log log n ) ) for larger k, looser quality

– result is k high-quality centroids

• For many purposes, even the sloppy surrogate may suffice

©MapR Technologies - Confidential

53 - Algorithm Costs

How much faster for the sketch phase?

– take k = 2000, d = 10, n = 100,000

– k d log n = 2000 x 10 x 26 = 500,000

– d (log k + log log n) = 10(11 + 5) = 170

– 3,000 times faster is a bona fide big deal

©MapR Technologies - Confidential

54 - Algorithm Costs

How much faster for the sketch phase?

– take k = 2000, d = 10, n = 100,000

– k d log n = 2000 x 10 x 26 = 500,000

– d (log k + log log n) = 10(11 + 5) = 170

– 3,000 times faster is a bona fide big deal

©MapR Technologies - Confidential

55 - How It Works

For each point

– Find approximately nearest centroid (distance = d)

– If (d > threshold) new centroid

– Else if (u > d/threshold) new cluster

– Else add to nearest centroid

If centroids > κ ≈ C log N

– Recursively cluster centroids with higher threshold

©MapR Technologies - Confidential

56 - Implementation

©MapR Technologies - Confidential

57 - But Wait, …

Finding nearest centroid is inner loop

This could take O( d κ ) per point and κ can be big

Happily, approximate nearest centroid works fine

©MapR Technologies - Confidential

58 - Projection Search

total ordering!

©MapR Technologies - Confidential

59 - LSH Bit-match Versus Cosine

1

0.8

0.6

0.4

0.2

is

x

0

A

Y

0

8

16

24

32

40

48

56

6

4

- 0.2

- 0.4

- 0.6

- 0.8

- 1

X Ax is

©MapR Technologies - Confidential

60 - Results

©MapR Technologies - Confidential

61 - Parallel Speedup?

200

Non- threaded

100

Threaded version

t (μs)

2

in

3

o

50

4

r p

40

e

6

5

p

e

30

8

im

10

14

T

12

20

Perfect Scaling

16

10 1

2

3

4

5

2

0

Threads

©MapR Technologies - Confidential

62 - Quality

Ball k-means implementation appears significantly better than

simple k-means

Streaming k-means + ball k-means appears to be about as good as

ball k-means alone

All evaluations on 20 newsgroups with held-out data

Figure of merit is mean and median squared distance to nearest

cluster

©MapR Technologies - Confidential

63 - Contact Me!

We’re hiring at MapR in US and Europe

MapR software available for research use

Get the code as part of Mahout trunk (or 0.8 very soon)

Contact me at tdunning@maprtech.com or @ted_dunning

Share news with @apachemahout

©MapR Technologies - Confidential

64