- A Terascale Learning Algorithm

Alekh Agarwal, Olivier Chapelle, Miroslav Dudik, and John

Langford

... And the Vowpal Wabbit project. - Interviewer: So, what do you want to do?

John: I’d like to solve AI.

I: How?

J: I want to use parallel learning algorithms to create fantastic

learning machines!

I: You fool! The only thing parallel machines are good for is

computational windtunnels!

The worst part: he had a point. At that time, smarter learning

algorithms always won. To win, we must master the best

single-machine learning algorithms, then clearly beat them with a

parallel approach.

Applying for a fellowship in 1997 - John: I’d like to solve AI.

I: How?

J: I want to use parallel learning algorithms to create fantastic

learning machines!

I: You fool! The only thing parallel machines are good for is

computational windtunnels!

The worst part: he had a point. At that time, smarter learning

algorithms always won. To win, we must master the best

single-machine learning algorithms, then clearly beat them with a

parallel approach.

Applying for a fellowship in 1997

Interviewer: So, what do you want to do? - I: How?

J: I want to use parallel learning algorithms to create fantastic

learning machines!

I: You fool! The only thing parallel machines are good for is

computational windtunnels!

The worst part: he had a point. At that time, smarter learning

algorithms always won. To win, we must master the best

single-machine learning algorithms, then clearly beat them with a

parallel approach.

Applying for a fellowship in 1997

Interviewer: So, what do you want to do?

John: I’d like to solve AI. - J: I want to use parallel learning algorithms to create fantastic

learning machines!

I: You fool! The only thing parallel machines are good for is

computational windtunnels!

The worst part: he had a point. At that time, smarter learning

algorithms always won. To win, we must master the best

single-machine learning algorithms, then clearly beat them with a

parallel approach.

Applying for a fellowship in 1997

Interviewer: So, what do you want to do?

John: I’d like to solve AI.

I: How? - I: You fool! The only thing parallel machines are good for is

computational windtunnels!

The worst part: he had a point. At that time, smarter learning

algorithms always won. To win, we must master the best

single-machine learning algorithms, then clearly beat them with a

parallel approach.

Applying for a fellowship in 1997

Interviewer: So, what do you want to do?

John: I’d like to solve AI.

I: How?

J: I want to use parallel learning algorithms to create fantastic

learning machines! - The worst part: he had a point. At that time, smarter learning

algorithms always won. To win, we must master the best

single-machine learning algorithms, then clearly beat them with a

parallel approach.

Applying for a fellowship in 1997

Interviewer: So, what do you want to do?

John: I’d like to solve AI.

I: How?

J: I want to use parallel learning algorithms to create fantastic

learning machines!

I: You fool! The only thing parallel machines are good for is

computational windtunnels! - Applying for a fellowship in 1997

Interviewer: So, what do you want to do?

John: I’d like to solve AI.

I: How?

J: I want to use parallel learning algorithms to create fantastic

learning machines!

I: You fool! The only thing parallel machines are good for is

computational windtunnels!

The worst part: he had a point. At that time, smarter learning

algorithms always won. To win, we must master the best

single-machine learning algorithms, then clearly beat them with a

parallel approach. - Demonstration
- 2.1T sparse features

17B Examples

16M parameters

1K nodes

70 minutes = 500M features/second: faster than the IO

bandwidth of a single machine⇒ we beat all possible single

machine linear learning algorithms.

(Actually, we can do even better now.)

Terascale Linear Learning ACDL11

Given 2.1 Terafeatures of data, how can you learn a good linear

predictor fw (x) =

w

i

i xi ? - 70 minutes = 500M features/second: faster than the IO

bandwidth of a single machine⇒ we beat all possible single

machine linear learning algorithms.

(Actually, we can do even better now.)

Terascale Linear Learning ACDL11

Given 2.1 Terafeatures of data, how can you learn a good linear

predictor fw (x) =

w

i

i xi ?

2.1T sparse features

17B Examples

16M parameters

1K nodes - (Actually, we can do even better now.)

Terascale Linear Learning ACDL11

Given 2.1 Terafeatures of data, how can you learn a good linear

predictor fw (x) =

w

i

i xi ?

2.1T sparse features

17B Examples

16M parameters

1K nodes

70 minutes = 500M features/second: faster than the IO

bandwidth of a single machine⇒ we beat all possible single

machine linear learning algorithms. - Terascale Linear Learning ACDL11

Given 2.1 Terafeatures of data, how can you learn a good linear

predictor fw (x) =

w

i

i xi ?

2.1T sparse features

17B Examples

16M parameters

1K nodes

70 minutes = 500M features/second: faster than the IO

bandwidth of a single machine⇒ we beat all possible single

machine linear learning algorithms.

(Actually, we can do even better now.) - Compare: Other Supervised Algorithms in Parallel Learning

book

Speed per method

1e+09

1e+08

parallel

single

1e+07

1e+06

100000

10000

Features/s

1000

100

RCV1

RCV1

MPI-32

Linear

Linear

Ads *&

MPI-128

TCP-48

RBF-SVM

MPI?-500

Synthetic

RBF-SVM

Ranking #

Threads-2

MNIST 220K

Boosted DT

Decision Tree

MapRed-200

Ad-Bounce #

Ensemble Tree

Hadoop+TCP-1000 - The tricks we use

First Vowpal Wabbit

Newer Algorithmics

Parallel Stuff

Feature Caching

Adaptive Learning

Parameter Averaging

Feature Hashing

Importance Updates

Smart Averaging

Online Learning

Dimensional Correction

Gradient Summing

L-BFGS

Hadoop AllReduce

Hybrid Learning

We’ll discuss Hashing, AllReduce, then how to learn. - RAM

String −> Index dictionary

Weights Weights

Conventional VW

Most algorithms use a hashmap to change a word into an index for

a weight.

VW uses a hash function which takes almost no RAM, is x10

faster, and is easily parallelized.

Empirically, radical state compression tricks possible with this. - Properties:

1

Easily pipelined so no latency concerns.

2

Bandwidth ≤ 6n.

3

No need to rewrite code!

MPI-style AllReduce

Allreduce initial state

7

5

6

1

2

3

4

AllReduce = Reduce+Broadcast - Properties:

1

Easily pipelined so no latency concerns.

2

Bandwidth ≤ 6n.

3

No need to rewrite code!

MPI-style AllReduce

Reducing, step 1

7

8

13

1

2

3

4

AllReduce = Reduce+Broadcast - Properties:

1

Easily pipelined so no latency concerns.

2

Bandwidth ≤ 6n.

3

No need to rewrite code!

MPI-style AllReduce

Reducing, step 2

28

8

13

1

2

3

4

AllReduce = Reduce+Broadcast - Properties:

1

Easily pipelined so no latency concerns.

2

Bandwidth ≤ 6n.

3

No need to rewrite code!

MPI-style AllReduce

Broadcast, step 1

28

28

28

1

2

3

4

AllReduce = Reduce+Broadcast - Properties:

1

Easily pipelined so no latency concerns.

2

Bandwidth ≤ 6n.

3

No need to rewrite code!

MPI-style AllReduce

Allreduce final state

28

28

28

28

28

28

28

AllReduce = Reduce+Broadcast - MPI-style AllReduce

Allreduce final state

28

28

28

28

28

28

28

AllReduce = Reduce+Broadcast

Properties:

1

Easily pipelined so no latency concerns.

2

Bandwidth ≤ 6n.

3

No need to rewrite code! - Other algorithms implemented:

1

Nonuniform averaging for online learning

2

Conjugate Gradient

3

LBFGS

An Example Algorithm: Weight averaging

n = AllReduce(1)

While (pass number < max)

1

While (examples left)

1

Do online update.

2

AllReduce(weights)

3

For each weight w ← w /n - An Example Algorithm: Weight averaging

n = AllReduce(1)

While (pass number < max)

1

While (examples left)

1

Do online update.

2

AllReduce(weights)

3

For each weight w ← w /n

Other algorithms implemented:

1

Nonuniform averaging for online learning

2

Conjugate Gradient

3

LBFGS - 2

Delayed initialization: Most failures are disk failures. First

read (and cache) all data, before initializing allreduce. Failures

autorestart on different node with identical data.

3

Speculative execution: In a busy cluster, one node is often

slow. Hadoop can speculatively start additional mappers. We

use the first to finish reading all data once.

The net effect: Reliable execution out to perhaps 10K node-hours.

What is Hadoop-Compatible AllReduce?

Program

Data

1

“Map” job moves program to data. - 3

Speculative execution: In a busy cluster, one node is often

slow. Hadoop can speculatively start additional mappers. We

use the first to finish reading all data once.

The net effect: Reliable execution out to perhaps 10K node-hours.

What is Hadoop-Compatible AllReduce?

Program

Data

1

“Map” job moves program to data.

2

Delayed initialization: Most failures are disk failures. First

read (and cache) all data, before initializing allreduce. Failures

autorestart on different node with identical data. - The net effect: Reliable execution out to perhaps 10K node-hours.

What is Hadoop-Compatible AllReduce?

Program

Data

1

“Map” job moves program to data.

2

Delayed initialization: Most failures are disk failures. First

read (and cache) all data, before initializing allreduce. Failures

autorestart on different node with identical data.

3

Speculative execution: In a busy cluster, one node is often

slow. Hadoop can speculatively start additional mappers. We

use the first to finish reading all data once. - What is Hadoop-Compatible AllReduce?

Program

Data

1

“Map” job moves program to data.

2

Delayed initialization: Most failures are disk failures. First

read (and cache) all data, before initializing allreduce. Failures

autorestart on different node with identical data.

3

Speculative execution: In a busy cluster, one node is often

slow. Hadoop can speculatively start additional mappers. We

use the first to finish reading all data once.

The net effect: Reliable execution out to perhaps 10K node-hours. - ..but computational complexity kills you.

..but this just

works globally rather than per dimension.

Algorithms: Preliminaries

Optimize so few data passes required ⇒ Smart algorithms.

Basic problem with gradient descent = confused units.

fw (x) =

w

i

i xi

⇒ ∂(fw (x)−y)2 = 2(f

∂w

w (x ) − y )xi which has units of i .

i

But wi naturally has units of 1/i since doubling xi implies halving

wi to get the same prediction.

Crude fixes:

−1

1

Newton: Multiply inverse Hessian:

∂2

by gradient to

∂wi ∂wj

get update direction.

2

Normalize update so total step size is controlled. - Algorithms: Preliminaries

Optimize so few data passes required ⇒ Smart algorithms.

Basic problem with gradient descent = confused units.

fw (x) =

w

i

i xi

⇒ ∂(fw (x)−y)2 = 2(f

∂w

w (x ) − y )xi which has units of i .

i

But wi naturally has units of 1/i since doubling xi implies halving

wi to get the same prediction.

Crude fixes:

−1

1

Newton: Multiply inverse Hessian:

∂2

by gradient to

∂wi ∂wj

get update direction...but computational complexity kills you.

2

Normalize update so total step size is controlled...but this just

works globally rather than per dimension. - 2

Dimensionally correct, adaptive, online, gradient descent for

small-multiple passes.

1

Online = update weights after seeing each example.

2

Adaptive = learning rate of feature i according to

1

√P g2i

where gi = previous gradients.

3

Dimensionally correct = still works if you double all feature

values.

3

Use (2) to warmstart (1).

2

Use map-only Hadoop for process control and error recovery.

3

Use custom AllReduce code to sync state.

4

Always save input examples in a cachefile to speed later

passes.

5

Use hashing trick to reduce input complexity.

Open source in Vowpal Wabbit 6.0. Search for it.

Approach Used

1

Optimize hard so few data passes required.

1

L-BFGS = batch algorithm that builds up approximate inverse

hessian according to: ∆w ∆Tw where ∆

∆T ∆

w is a change in weights

w

g

w and ∆g is a change in the loss gradient g . - 2

Use map-only Hadoop for process control and error recovery.

3

Use custom AllReduce code to sync state.

4

Always save input examples in a cachefile to speed later

passes.

5

Use hashing trick to reduce input complexity.

Open source in Vowpal Wabbit 6.0. Search for it.

Approach Used

1

Optimize hard so few data passes required.

1

L-BFGS = batch algorithm that builds up approximate inverse

hessian according to: ∆w ∆Tw where ∆

∆T ∆

w is a change in weights

w

g

w and ∆g is a change in the loss gradient g .

2

Dimensionally correct, adaptive, online, gradient descent for

small-multiple passes.

1

Online = update weights after seeing each example.

2

Adaptive = learning rate of feature i according to

1

√P g2i

where gi = previous gradients.

3

Dimensionally correct = still works if you double all feature

values.

3

Use (2) to warmstart (1). - 3

Use custom AllReduce code to sync state.

4

Always save input examples in a cachefile to speed later

passes.

5

Use hashing trick to reduce input complexity.

Open source in Vowpal Wabbit 6.0. Search for it.

Approach Used

1

Optimize hard so few data passes required.

1

L-BFGS = batch algorithm that builds up approximate inverse

hessian according to: ∆w ∆Tw where ∆

∆T ∆

w is a change in weights

w

g

w and ∆g is a change in the loss gradient g .

2

Dimensionally correct, adaptive, online, gradient descent for

small-multiple passes.

1

Online = update weights after seeing each example.

2

Adaptive = learning rate of feature i according to

1

√P g2i

where gi = previous gradients.

3

Dimensionally correct = still works if you double all feature

values.

3

Use (2) to warmstart (1).

2

Use map-only Hadoop for process control and error recovery. - 4

Always save input examples in a cachefile to speed later

passes.

5

Use hashing trick to reduce input complexity.

Open source in Vowpal Wabbit 6.0. Search for it.

Approach Used

1

Optimize hard so few data passes required.

1

L-BFGS = batch algorithm that builds up approximate inverse

hessian according to: ∆w ∆Tw where ∆

∆T ∆

w is a change in weights

w

g

w and ∆g is a change in the loss gradient g .

2

Dimensionally correct, adaptive, online, gradient descent for

small-multiple passes.

1

Online = update weights after seeing each example.

2

Adaptive = learning rate of feature i according to

1

√P g2i

where gi = previous gradients.

3

Dimensionally correct = still works if you double all feature

values.

3

Use (2) to warmstart (1).

2

Use map-only Hadoop for process control and error recovery.

3

Use custom AllReduce code to sync state. - 5

Use hashing trick to reduce input complexity.

Open source in Vowpal Wabbit 6.0. Search for it.

Approach Used

1

Optimize hard so few data passes required.

1

L-BFGS = batch algorithm that builds up approximate inverse

hessian according to: ∆w ∆Tw where ∆

∆T ∆

w is a change in weights

w

g

w and ∆g is a change in the loss gradient g .

2

Dimensionally correct, adaptive, online, gradient descent for

small-multiple passes.

1

Online = update weights after seeing each example.

2

Adaptive = learning rate of feature i according to

1

√P g2i

where gi = previous gradients.

3

Dimensionally correct = still works if you double all feature

values.

3

Use (2) to warmstart (1).

2

Use map-only Hadoop for process control and error recovery.

3

Use custom AllReduce code to sync state.

4

Always save input examples in a cachefile to speed later

passes. - Open source in Vowpal Wabbit 6.0. Search for it.

Approach Used

1

Optimize hard so few data passes required.

1

L-BFGS = batch algorithm that builds up approximate inverse

hessian according to: ∆w ∆Tw where ∆

∆T ∆

w is a change in weights

w

g

w and ∆g is a change in the loss gradient g .

2

Dimensionally correct, adaptive, online, gradient descent for

small-multiple passes.

1

Online = update weights after seeing each example.

2

Adaptive = learning rate of feature i according to

1

√P g2i

where gi = previous gradients.

3

Dimensionally correct = still works if you double all feature

values.

3

Use (2) to warmstart (1).

2

Use map-only Hadoop for process control and error recovery.

3

Use custom AllReduce code to sync state.

4

Always save input examples in a cachefile to speed later

passes.

5

Use hashing trick to reduce input complexity. - Approach Used

1

Optimize hard so few data passes required.

1

L-BFGS = batch algorithm that builds up approximate inverse

hessian according to: ∆w ∆Tw where ∆

∆T ∆

w is a change in weights

w

g

w and ∆g is a change in the loss gradient g .

2

Dimensionally correct, adaptive, online, gradient descent for

small-multiple passes.

1

Online = update weights after seeing each example.

2

Adaptive = learning rate of feature i according to

1

√P g2i

where gi = previous gradients.

3

Dimensionally correct = still works if you double all feature

values.

3

Use (2) to warmstart (1).

2

Use map-only Hadoop for process control and error recovery.

3

Use custom AllReduce code to sync state.

4

Always save input examples in a cachefile to speed later

passes.

5

Use hashing trick to reduce input complexity.

Open source in Vowpal Wabbit 6.0. Search for it. - Robustness & Speedup

Speed per method

10

Average_10

9

Min_10

8

Max_10

linear

7

6

5

Speedup

4

3

2

1

0

10

20

30

40

50

60

70

80

90

100

Nodes - Webspam optimization

Webspam test results

0.5

Online

0.45

Warmstart L-BFGS 1

0.4

L-BFGS 1

Single machine online

0.35

0.3

0.25

0.2

test log loss

0.15

0.1

0.05

0

0

5

10 15 20 25 30 35 40 45 50

pass - How does it work?

The launch sequence (on Hadoop):

1

mapscript.sh hdfs output hdfs input maps asked

2

mapscript.sh starts spanning tree on the gateway. Each vw

connects to spanning tree to learn which other vw is adjacent

in the binary tree.

3

mapscript.sh starts hadoop streaming map-only job: runvw.sh

4

runvw.sh calls vw twice.

5

First vw call does online learning while saving examples in a

cachefile. Weights are averaged before saving.

6

Second vw uses online solution to warmstart L-BFGS.

Gradients are shared via allreduce.

Example: mapscript.sh outdir indir 100

Everything also works in a nonHadoop cluster: you just need to

script more to do some of the things that Hadoop does for you. - How does this compare to other cluster learning efforts?

Mahout was founded as the MapReduce ML project, but has

grown to include VW-style online linear learning. For linear

learning, VW is superior:

1

Vastly faster.

2

Smarter Algorithms.

Other algorithms exist in Mahout, but they are apparently buggy.

AllReduce seems a superior paradigm in the 10K node-hours

regime for Machine Learning.

Graphlab(@CMU) doesn’t overlap much. This is mostly about

graphical model evaluation and learning.

Ultra LDA(@Y!), overlaps partially. VW has noncluster-parallel

LDA which is perhaps x3 more effecient. - Can allreduce be used by others?

It might be incorporated directly into next generation Hadoop.

But for now, it’s quite easy to use the existing code.

void all reduce(char* buffer, int n, std::string master location,

size t unique id, size t total, size t node);

buffer = pointer to some floats

n = number of bytes (4*floats)

master location = IP address of gateway

unique id = nonce (unique for different jobs)

total = total number of nodes

node = node id number

The call is stateful: it initializes the topology if necessary. - Further Pointers

search://Vowpal Wabbit

mailing list: vowpal wabbit@yahoogroups.com

VW tutorial (Preparallel): http://videolectures.net

Machine Learning (Theory) blog: http://hunch.net - Bibliography: Original VW

Caching L. Bottou. Stochastic Gradient Descent Examples on Toy

Problems, http://leon.bottou.org/projects/sgd, 2007.

Release Vowpal Wabbit open source project,

http://github.com/JohnLangford/vowpal_wabbit/wiki,

2007.

Hashing Q. Shi, J. Petterson, G. Dror, J. Langford, A. Smola, and

SVN Vishwanathan, Hash Kernels for Structured Data,

AISTAT 2009.

Hashing K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J.

Attenberg, Feature Hashing for Large Scale Multitask

Learning, ICML 2009. - Bibliography: Algorithmics

L-BFGS J. Nocedal, Updating Quasi-Newton Matrices with Limited

Storage, Mathematics of Computation 35:773–782, 1980.

Adaptive H. B. McMahan and M. Streeter, Adaptive Bound

Optimization for Online Convex Optimization, COLT 2010.

Adaptive J. Duchi, E. Hazan, and Y. Singer, Adaptive Subgradient

Methods for Online Learning and Stochastic Optimization,

COLT 2010.

Import. N. Karampatziakis, and J. Langford, Online Importance

Weight Aware Updates, UAI 2011. - Bibliography: Parallel

grad sum C. Teo, Q. Le, A. Smola, V. Vishwanathan, A Scalable

Modular Convex Solver for Regularized Risk Minimization,

KDD 2007.

averaging G. Mann, R. Mcdonald, M. Mohri, N. Silberman, and D.

Walker. Efficient large-scale distributed training of conditional

maximum entropy models, NIPS 2009.

averaging K. Hall, S. Gilpin, and G. Mann, MapReduce/Bigtable for

Distributed Optimization, LCCC 2010.

(More forthcoming, of course)