このページは http://www.slideshare.net/matsubaray/distributed-perceptron の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

- Distributed Perceptron

Introducing Distributed Training Strategies for the

Structured Perceptron,

published by R. McDonald, K. Hall & G. Mann

in NAACL 2010

2010-10-06 / 2nd seminar for State-of-the-Art NLP - Distributed training of perceptrons

in a theoretically-proven way

Naive distribution strategy fails

Parameter mixing (or averaging)

Simple modification

Iterative parameter mixing

Proofs & Experiments

Convergence

Convergence speed

NER experiments

Dependency parsing experiments - Timeline

1958 F. Rosenblatt

Principles of Neurodynamics: Perceptrons and the

Theory of Brain Mechanisms

1962 H.D. Block and A.B. Novikoff (independently)

the perceptron convergence theorem of for the

separable case

1999 Y. Freund & R.E. Schapire

voted perceptron with a bound to the generalization error

for the inseparable case

2002 M. Collins

Generalization to the structured prediction problem

2010 R. McDonald et al

parallelization with parameter mixing and

synchronization - A new strategy of parallelization is

required for distributed perceptrons

Gradient-based batch training algorithms have been

parallelized in the forms of Map-Reduce

Parameter mixing works for maximum entropy models

Divide the training data into a number of shards

Train separate models with the shards

Take average of the weights of the models

Perceptrons?

Non-convex objective function

Simple parameter mixing doesn't work - Parameter mixing (averaging) fails (1/6)

Parameter mixing:

Train S perceptrons with S shards of the training data,

Take a weighted average of their weights

Distributed Training Strategies for the Structured Perceptron

by R. McDonald, K. Hall & G. Mann, 2010 - Parameter mixing (averaging) fails (2/6)

Counter example

Feature space (separated into observed and non-observed examples):

f(x ,0) = [1 1 0 0 0 0] f(x ,1) = [0 0 0 1 1 0]

1,1

1,1

f(x ,0) = [0 0 1 0 0 0] f(x ,1) = [0 0 0 0 0 1]

1,2

1,2

f(x ,0) = [0 1 1 0 0 0] f(x ,1) = [0 0 0 1 1 1]

2,1

2,1

f(x ,0) = [1 0 0 0 0 0] f(x ,1) = [0 0 0 1 0 0]

2,2

2,2

Preview of the consequence:

Shard 1:

Mixing of two local optima

(x

0), (x

1)

1,1,

1,2,

Smaller data can fool the

Shard 2:

algorithm, because of the

(x

0), (x

1)

2,1,

2,2,

increased initializations and tie-

breakings. - Parameter mixing (averaging) fails (3/6)

Counter example

Feature space:

f(x ,0) = [1 1 0 0 0 0] f(x ,1) = [0 0 0 1 1 0]

1,1

1,1

f(x ,0) = [0 0 1 0 0 0] f(x ,1) = [0 0 0 0 0 1]

1,2

1,2

w := [0 0 0 0 0 0] {initialization}

Shard 1:

1

(x

0), (x

1)

w ·f(x ,0)t ≦ w ·f(x ,1)t

1,1,

1,2,

1

1,1

1

1,1

w := [1 1 0 0 0 0] - [0 0 0 1 1 0]

1

= [1 1 0 -1 -1 0]

w ·f(x ,0)t ≦ w ·f(x ,1)t {tie-breaking}

1

1,2

1

1,1 - Parameter mixing (averaging) fails (4/6)

Counter example

Feature space:

f(x ,0) = [0 1 1 0 0 0] f(x ,1) = [0 0 0 1 1 1]

2,1

2,1

f(x ,0) = [1 0 0 0 0 0] f(x ,1) = [0 0 0 1 0 0]

2,2

2,2

w := [0 0 0 0 0 0] {initialization}

Shard 2:

2

(x

0), (x

1)

w ·f(x ,0)t ≦ w ·f(x ,1)t

2,1,

2,2,

2

2,1

1

2,1

w := [0 1 1 0 0 0] - [0 0 0 0 1 1]

2

= [0 1 1 0 -1 -1]

w ·f(x ,0)t ≦ w ·f(x ,1)t {tie-breaking}

2

2,2

2

2,2 - Parameter mixing (averaging) fails (5/6)

Counter example

Feature space:

f(x ,0) = [1 1 0 0 0 0] f(x ,1) = [0 0 0 1 1 0]

1,1

1,1

f(x ,0) = [0 0 1 0 0 0] f(x ,1) = [0 0 0 0 0 1]

1,2

1,2

f(x ,0) = [0 1 1 0 0 0] f(x ,1) = [0 0 0 1 1 1]

2,1

2,1

f(x ,0) = [1 0 0 0 0 0] f(x ,1) = [0 0 0 1 0 0]

2,2

2,2

Shard 1:

(x

0), (x

1) ... w =[1 1 0 -1 -1 0]

mixed weight:

1,1,

1,2,

1

Shard 2:

[μ 1 μ -μ -1 -μ ]

1

2

1

2

(x

0), (x

1) ... w =[0 1 1 0 -1 -1]

2,1,

2,2,

2 - Parameter mixing (averaging) fails (6/6)

Counter example

Feature space:

f(x ,0) = [1 1 0 0 0 0] f(x ,1) = [0 0 0 1 1 0] ... μ +1, -μ -1

1,1

1,1

1

1

f(x ,0) = [0 0 1 0 0 0] f(x ,1) = [0 0 0 0 0 1] ... μ , -μ

1,2

1,2

2

2

f(x ,0) = [0 1 1 0 0 0] f(x ,1) = [0 0 0 1 1 1] ... μ +1, -μ -1

2,1

2,1

2

2

f(x ,0) = [1 0 0 0 0 0] f(x ,1) = [0 0 0 1 0 0] ... μ , -μ

2,2

2,2

1

1

Mixed weight [μ 1 μ -μ -1 -μ ] doesn't separate positives and

1

2

1

2

negatives:

LHS feature vectors always beat RHS vectors

w·f(*,0) ≦ w·f(*,1)

But there is a separating weight vector: [-1 2 -1 1 -2 1] - Iterative parameter mixing
- Convergence theorem of iterative

parameter mixing (1/4)

Assumptions

u: separating weight vector

γ: margin, γ ≦ u ·(f(x ,y ) - f(x , )) for all t and y'

t

t

t y'

R: max |f(x ,y ) - f(x , )|

t,y'

t

t

t y'

k : the number of updates (errors) occur in the n th

i,n

epoch of the i th OneEpochPerceptron

Distributed Training Strategies for the

Structured Perceptron

by R. McDonald, K. Hall & G. Mann, 2010 - Convergence theorem of iterative

parameter mixing (2/4)

Lowerbound of the number of the errors in a epoch

← from definition:

γ ≦ u ·(f(x ,y ) - f(x ,

t

t

t

))

y'

By induction on n, u·w(avg,N) ≧ Σ Σ μ k γ

n i

i,n i,n

Distributed Training Strategies for the Structured Perceptron

by R. McDonald, K. Hall & G. Mann, 2010 - Convergence theorem of iterative

parameter mixing (3/4)

Upperbound of the number of the errors in a epoch

← from definition:

R ≧ |f(x ,y ) - f(x , )|

t

t

t y'

y' = argmax w f(...)

y

By induction on n, |w(avg,N)|2 ≦ Σ Σ μ k R2

n i

i,n

i,n

Distributed Training Strategies for the Structured Perceptron

by R. McDonald, K. Hall & G. Mann, 2010 - Convergence theorem of iterative

parameter mixing (4/4)

|w(avg,N)|2 ≧ (u·w(avg,N))2 ≧ (Σ Σ μ k γ)2

n i

i,n

i,n

= (Σ Σ μ k )2γ2

n i

i,n

i,n

|w(avg,N)|2 ≦ (Σ Σ μ k ) R2

n i

i,n

i,n

(Σ Σ μ k )2γ2 ≦ (Σ Σ μ k ) R2

n i

i,n

i,n

n i

i,n

i,n

(Σ Σ μ k )γ2 ≦ R2

n i

i,n

i,n

(Σ Σ μ k ) ≦ R2/γ2

n i

i,n

i,n

Distributed Training Strategies for the Structured Perceptron

by R. McDonald, K. Hall & G. Mann, 2010 - Convergence speed is predicted in two

ways (1/2)

Theorem 3 implies

When we take uniform weights for mixing, the number of

errors is proportional to the number of shards (in worst

case when the equality holds)

implying that we cannot benefit from the parallelization

very much

#(errors per epoch) can be multiplied by S

the time required in an epoch would reduced to 1/S.

Distributed Training Strategies for the Structured Perceptron

by R. McDonald, K. Hall & G. Mann, 2010 - Convergence speed is predicted in two

ways (2/2)

Section 4.3

When we take error-proportional weighting for mixing, the

number of epochs N

is bounded by

dist

↑error-proportional mixing

geometric mean ≦ arithmetic mean

Worst case (when the equality holds)

The same number of epochs as the vanilla perceptron

Even in that case, each epoch is S times faster because

of the parallelization

N

doesn't depend on the number of shards

dist

implying that we can well benefit from parallelization

Distributed Training Strategies for the Structured Perceptron

by R. McDonald, K. Hall & G. Mann, 2010 - Experiments

Comparison

Serial (All Data)

Serial (Sub Sampling): use only one shard

Parallel (Parameter Mix)

Parallel (Iterative Parameter Mix)

Settings

Number of shards: 10

(see the paper for more details) - NER experiments: faster & better, close

to averaged perceptrons

Distributed Training Strategies for the Structured Perceptron

by R. McDonald, K. Hall & G. Mann, 2010 - NER experiments: faster & better, close

to averaged perceptrons

Iterative mixing is faster and

Iterative mixing is faster and

more accurate than serial. (non-

similarly accurate to serial.

averaged case)

(averaged case)

Distributed Training Strategies for the Structured Perceptron

by R. McDonald, K. Hall & G. Mann, 2010 - Dependency parsing experiments:

similar improvements

Distributed Training Strategies for the Structured Perceptron

by R. McDonald, K. Hall & G. Mann, 2010 - Different shard size: the more shards,

the slower convergence

Distributed Training Strategies for the Structured Perceptron

by R. McDonald, K. Hall & G. Mann, 2010 - Different shard size: the more shards,

the slower convergence

High parallelism leads to

slower convergence (in a

rate somewhere middle in the

two predictions)

Distributed Training Strategies for the Structured Perceptron

by R. McDonald, K. Hall & G. Mann, 2010 - Conclusions

Distributed training of the structured perceptron via simple

parameter mixing strategies

Guaranteed to converge and separate the data (if

separable)

Results in fast and accurate classifiers

Trade-off between high parallelism and slow convergence

(+ applicable to online passive-aggressive algorithm) - Presenter's comments

Parameter synchronization can be slow, especially when

the feature space or the number of epochs is large

Analysis of the generalization error (for inseparable case)?

Relation to voted perceptron?

Voted perceptron: weighting with survival time

Distributed perceptron: weighting with the number of

updates

Relation to Bayes point machines?