このページは https://speakerdeck.com/tetsuok/introduction-to-lp-mert の内容を掲載しています。

掲載を希望されないスライド著者の方は、削除申請よりご連絡下さい。

埋込み型プレイヤーを使用せず、常に元のサイトでご覧になりたい方は、自動遷移設定をご利用下さい。

約5年前 (2012/11/29)にアップロードin学び

Slides presented at my lab about Introduction to LP-MERT (Galley & Quirk, EMNLP 2011), which is a...

Slides presented at my lab about Introduction to LP-MERT (Galley & Quirk, EMNLP 2011), which is an exact search algorithm for minimum error rate training for statistical machine translation.

- Optimal Search for

Minimum Error Rate Training

Michel Gal ey and Chris Quirk

EMNLP 2011

Presenter: Tetsuo Kiso

tetsuo-s@is.naist.jp

MT study group meeting November 22, 2012

Feel free to email me if these slide contains any mistakes!

Thursday, November 29, 12 - Introduction

Thursday, November 29, 12 - SMT system needs tuning

monolingual corpus parallel corpus

decoder

moses

language model translation model

image url: http://wyofile.com/wp-content/uploads/2010/11/chiefjoe-haultruck.jpg

3

Thursday, November 29, 12 - SMT system needs tuning

monolingual corpus parallel corpus

decoder

moses

language model translation model

common practice: parameters are tuned using

MERT on a small development set

image url: http://wyofile.com/wp-content/uploads/2010/11/chiefjoe-haultruck.jpg

3

Thursday, November 29, 12 - MERT (Och, 2003)

4

Thursday, November 29, 12 - MERT (Och, 2003)

• popular in SMT and speech recognition

• directly optimizes the evaluation metric

- e.g., BLEU (Papineni+, 2002), TER (Snover+, 2006)

• original y from speech recognition (as always)

4

Thursday, November 29, 12 - MERT (Och, 2003)

• popular in SMT and speech recognition

• directly optimizes the evaluation metric

- e.g., BLEU (Papineni+, 2002), TER (Snover+, 2006)

• original y from speech recognition (as always)

• Typical y, search space is approximated by N-

best lists

- Extensions: lattice MERT (Macherey+, 2008),

hypergraph MERT (Kumar+, 2009)

4

Thursday, November 29, 12 - Chal enges in MERT

• BLEU is highly non-convex, difficult to

optimize directly.

• Och’s line minimization efficiently searches

the error surface in tuning a single parameter

• But, it remains inexact in tuning parameters

simultaneously (multi-dimensional case)

5

Thursday, November 29, 12 - Contributions

• In theory: develop exact search algorithm

over the parameter spaces in the multi-

dimensional case. (cal ed LP-MERT)

• In practice, LP-MERT is often

computational y expensive over thousands

of sentences; use approximations

- search only promising regions

- w/ beam search

6

Thursday, November 29, 12 - Outline

• Introduction

• Objective function of MERT

• LP-MERT

- exact case

- approximations

• Experiments

7

Thursday, November 29, 12 - MERT: minimize error counts

error (e.g., 1-BLEU)

reference source sentence

(

)

S

X

ˆ

w = arg min

E(rs, ˆes(fs; w))

w

s=1

(

)

S

X N

X

= arg min

E(rs, es,n) (es,n, ˆ

e(fs; w))

w

s=1 n=1

where

candidate translation

ˆ

e(fs; w) = arg max wThs,n

n2{1,. . . ,N}

feature vector

w, h 2 RD

consider:

8

Thursday, November 29, 12 - LP-MERT

Thursday, November 29, 12 - Starts with the simple case:

the tuning set contains only one sentence.

candidate space

tuning set

e1

e2

f1

...

eN

10

Thursday, November 29, 12 - Starts with the simple case:

the tuning set contains only one sentence.

candidate space

tuning set

e1

e2

f1

...

eN

Searching for minimizing task loss on

a single N-best list

10

Thursday, November 29, 12 - Think about two

dimensional feature space

feature space

candidate space

hidden variable

e1

h(f,e,~)

e2

feature 2

...

eN

feature 1

11

Thursday, November 29, 12 - Think about two

dimensional feature space

each point corresponds to a

feature space

feature vector for each translation

candidate space

hidden variable

h2(e2)

e1

h(f,e,~)

e2

feature 2

...

eN

feature 1

11

Thursday, November 29, 12 - Think about two

dimensional feature space

each point corresponds to a

feature space

feature vector for each translation

candidate space

hidden variable

h2(e2)

e1

h(f,e,~)

e2

feature 2

...

eN

feature 1

NOTE: Henceforth, we simply use hi

11

Thursday, November 29, 12 - By exhaustively enumerating the N

translations, we can find the one which

yields the minimal task loss.

h3: 0.37

h1: 0.28

task loss

h2: 0.21

(lower is better)

h4: 0.31

feature 2

feature 1

12

Thursday, November 29, 12 - By exhaustively enumerating the N

translations, we can find the one which

yields the minimal task loss.

h3: 0.37

h1: 0.28

task loss

h2: 0.21

(lower is better)

h4: 0.31

feature 2

Assumption:

Task loss can be computed

at sentence-level.

feature 1

12

Thursday, November 29, 12 - We have to answer the question:

Which feature vectors can any linear

model maximize?

h3: 0.37

h1: 0.28

h2: 0.21

h4: 0.31

feature 2

feature 1

13

Thursday, November 29, 12 - This can be done by checking whether

the feature vector is inside the convex

hull or not.

h3: 0.37

h1: 0.28

h2: 0.21

h4: 0.31

feature 2

feature 1

14

Thursday, November 29, 12 - This can be done by checking whether

the feature vector is inside the convex

hull or not.

h3: 0.37

h1: 0.28

h2: 0.21

h4: 0.31

feature 2

convex hull

凸包

feature 1

14

Thursday, November 29, 12 - This can be done by checking whether

the feature vector is inside the convex

hull or not.

h3: 0.37

h1: 0.28

interior points

h2: 0.21

h4: 0.31

feature 2

convex hull

凸包

feature 1

14

Thursday, November 29, 12 - This can be done by checking whether

the feature vector is inside the convex

hull or not.

h3: 0.37

h1: 0.28

interior points

h2: 0.21

cannot be

maximized!

h4: 0.31

feature 2

convex hull

凸包

feature 1

14

Thursday, November 29, 12 - This can be done by checking whether

the feature vector is inside the convex

hull or not.

extreme points

h3: 0.37

h1: 0.28

h2: 0.21

h4: 0.31

feature 2

convex hull

凸包

feature 1

15

Thursday, November 29, 12 - This can be done by checking whether

the feature vector is inside the convex

hull or not.

extreme points

h3: 0.37

h1: 0.28

h2: 0.21

h4: 0.31

feature 2

convex hull

凸包

feature 1

15

Thursday, November 29, 12 - This can be done by checking whether

the feature vector is inside the convex

hull or not.

extreme points

h3: 0.37

h1: 0.28

can be maximized!

h2: 0.21

h4: 0.31

feature 2

convex hull

凸包

feature 1

15

Thursday, November 29, 12 - Our task: find extreme points with the

minimal task loss.

h3: 0.37

h1: 0.28

h4: 0.31

feature 2

convex hull

凸包

feature 1

スライドはイメージです。実際とは異なる場合があります。 16

Thursday, November 29, 12 - Our task: find extreme points with the

minimal task loss.

h3: 0.37

h1: 0.28

h4: 0.31

feature 2

convex hull

凸包

feature 1

スライドはイメージです。実際とは異なる場合があります。 16

Thursday, November 29, 12 - Let’s compute convex hull?

• We know QuickHull algorithm (Eddy, 1977;

Barber+, 1996)

• Once we construct the convex hull, we can

find the extreme points with the lowest loss!

17

Thursday, November 29, 12 - Let’s compute convex hull?

• We know QuickHull algorithm (Eddy, 1977;

Barber+, 1996)

• Once we construct the convex hull, we can

find the extreme points with the lowest loss!

This does not scale!

Computing convex hull is very expensive!

17

Thursday, November 29, 12 - Let’s compute convex hull?

• We know QuickHull algorithm (Eddy, 1977;

Barber+, 1996)

• Once we construct the convex hull, we can

find the extreme points with the lowest loss!

This does not scale!

Computing convex hull is very expensive!

Time complexity of the best known convex hull

algorithm (Barber+, 1996): O(NbD/2c+1)

17

Thursday, November 29, 12 - We wanna find extreme points without

computing any convex hull explicitly.

18

Thursday, November 29, 12 - We wanna find extreme points without

computing any convex hull explicitly.

• We use linear programming (LP).

• Because we know a polynomial time algorithm

(Karmarkar, 1984) to achieve this requirement.

Time complexity: O(ND3.5) ≪ O(NbD/2c+1)

18

Thursday, November 29, 12 - The Breakthrough

Interior-Point Methods—The Breakthrough

http://www.princeton.edu/~rvdb/542/lectures/lec14.pdf

19

Thursday, November 29, 12 - This algorithm tells us whether

the given point hi is extreme.

feature 2

we don’t explicitly construct

the convex hull

feature 1

Karmarkar return w (≠ 0) if h is extreme

algorithm

return 0 otherwise

0

is interior

20

Thursday, November 29, 12 - This algorithm tells us whether

the given point hi is extreme.

feature 2

we don’t explicitly construct

the convex hull

feature 1

Karmarkar return w (≠ 0) if h is extreme

algorithm

return 0 otherwise

the weight vector

ŵ

is extreme

maximizes

21

Thursday, November 29, 12 - This algorithm tells us whether

the given point hi is extreme.

feature 2

we don’t explicitly construct

the convex hull

feature 1

Karmarkar return w (≠ 0) if h is extreme

algorithm

return 0 otherwise

the weight vector

ŵ

is extreme

maximizes

21

Thursday, November 29, 12 - LP-MERT for a single sentence

h3:

h1:

h2:

feature 2

h4:

feature 1

22

Thursday, November 29, 12 - LP-MERT for a single sentence

h3: 0.37 h1: 0.28

• compute task loss for

N-best translations

h2: 0.21

feature 2

h4: 0.31

feature 1

22

Thursday, November 29, 12 - LP-MERT for a single sentence

h3: 0.37 h1: 0.28

• compute task loss for

N-best translations

h2: 0.21

• sort N-best by

feature 2

h4: 0.31

increasing losses

feature 1

22

Thursday, November 29, 12 - LP-MERT for a single sentence

h3: 0.37 h1: 0.28

• compute task loss for

N-best translations

①h2: 0.21

• sort N-best by

feature 2

h4: 0.31

increasing losses

feature 1

22

Thursday, November 29, 12 - LP-MERT for a single sentence

h3: 0.37 ②

h1: 0.28

• compute task loss for

N-best translations

①h2: 0.21

• sort N-best by

feature 2

h4: 0.31

increasing losses

feature 1

22

Thursday, November 29, 12 - LP-MERT for a single sentence

h3: 0.37 ②

h1: 0.28

• compute task loss for

N-best translations

①h2: 0.21

③

• sort N-best by

feature 2

h4: 0.31

increasing losses

feature 1

22

Thursday, November 29, 12 - LP-MERT for a single sentence

h3:

④ 0.37 ②h1: 0.28

• compute task loss for

N-best translations

①h2: 0.21

③

• sort N-best by

feature 2

h4: 0.31

increasing losses

feature 1

22

Thursday, November 29, 12 - LP-MERT for a single sentence

h3:

④ 0.37 ②h1: 0.28

• compute task loss for

N-best translations

①h2: 0.21

③

• sort N-best by

feature 2

h4: 0.31

increasing losses

• for each candidate

feature 1

- w ← run Karmarkar

Karmarkar

- If w ≠ 0

algorithm

- return w

• return 0

22

Thursday, November 29, 12 - LP-MERT for a single sentence

h3:

④ 0.37 ②h1: 0.28

• compute task loss for

N-best translations

①h2: 0.21

③

• sort N-best by

feature 2

h4: 0.31

increasing losses

• for each candidate

feature 1

- w ← run Karmarkar

Karmarkar

- If w ≠ 0

algorithm

- return w

• return 0

22

Thursday, November 29, 12 - LP-MERT for a single sentence

h3:

④ 0.37 ②h1: 0.28

• compute task loss for

N-best translations

①h2: 0.21

③

• sort N-best by

feature 2

h4: 0.31

increasing losses

• for each candidate

feature 1

- w ← run Karmarkar

Karmarkar

- If w ≠ 0

algorithm

- return w

• return 0

22

Thursday, November 29, 12 - LP-MERT for a single sentence

h3:

④ 0.37 ②h1: 0.28

• compute task loss for

N-best translations

①h2: 0.21

③

• sort N-best by

feature 2

h4: 0.31

increasing losses

• for each candidate

feature 1

- w ← run Karmarkar

Karmarkar

- If w ≠ 0

algorithm

- return w

• return 0

0

22

Thursday, November 29, 12

h3:

④ 0.37 ②h1: 0.28

• compute task loss for

N-best translations

①h2: 0.21

③

• sort N-best by

feature 2

h4: 0.31

increasing losses

• for each candidate

feature 1

- w ← run Karmarkar

Karmarkar

- If w ≠ 0

algorithm

- return w

• return 0

22

Thursday, November 29, 12- LP-MERT for a single sentence

h3:

④ 0.37 ②h1: 0.28

• compute task loss for

N-best translations

①h2: 0.21

③

• sort N-best by

feature 2

h4: 0.31

increasing losses

• for each candidate

feature 1

- w ← run Karmarkar

Karmarkar

- If w ≠ 0

algorithm

- return w

• return 0

w done!

22

Thursday, November 29, 12 - LP-MERT for a single sentence

h3:

④ 0.37 ②h1: 0.28

• compute task loss for

N-best translations

①h2: 0.21

③

• sort N-best by

feature 2

h4: 0.31

increasing losses

• for each candidate

feature 1

- w ← run Karmarkar

Karmarkar

- If w ≠ 0

algorithm

- return w

• return 0

w done! Time complexity: O(N2D3.5)22

Thursday, November 29, 12 - Empirical performance:

Karmarkar (LP) vs. QuickHull

above equations represent a linear program (LP),

100000

which can be turned into canonical form

10000

QuickHull

1000

LP

maximize c| w

100

subject to Aw b

s

d

10

on

by substituting y with w|h

Sec

1

i in Eq. 3, by defining

A = {an,d}1 n N;1 d D with an,d = hj,d

hi,d

0.1

(where hj,d is the d-th element of hj), and by setting

0.01

b = (0, . . . , 0)| = 0. The vertex hi is extreme if

0.001

and only if the LP solver finds a non-zero vector w

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

satisfying the canonical system. To ensure that w is

Dimensions

zero only when hi is interior, we set c = hi

hµ,

where

Figure 2: Running times to exactly optimize

h

N -best lists

µ is a point known to be inside the hull (e.g.,

with an increasing number of dimensions. To determine

the centroid of the N-best list).6 In the remaining

which feature vectors were on the hull, we use either linear

of this section, we use this LP formulation in func- programming (Karmarkar, 1984) or one of the most effi-

tion LINOPTIMIZER(hi; h1 . . . hN), which returns

cient convex hull computation tools (Barber et al., 1996).

the weight vector ˆ

w maximizing hi, or which returns

(Galley and Quirk, 2011)

0 if hi is interior to conv(h1 . . . hN ). We also use

23

method of (Karmarkar, 1984), and since the main

conv(hi; h1 . . . hN ) to denote whether hi is extreme

with respect to this hull.

Thursday, November 29, 12

loop may run O(N) times in the worst case, time

complexity is O(N2T ). Finally, Fig. 2 empirically

demonstrates the effectiveness of a linear program-

Algorithm 1: LP-MERT (for S = 1).

ming approach, which in practice is seldom affected

input :sent.-level feature vectors H = {h1 . . . hN}

by D.

input :sent.-level task losses E1 . . . EN, where

En := E(r1, e1,n)

3.1 Exact search: general case

output :optimal weight vector ˆw

We now extend LP-MERT to the general case, in

1 begin

. sort N -best list by increasing losses:

which we are optimizing multiple sentences at once.

2

(i

This creates an intricate optimization problem, since

1 . . . iN ) INDEXSORT(E1 . . . EN )

3

for n 1 to N do

the inner summations over n = 1 . . . N in Eq. 1

. find ˆ

w maximizing in-th element:

can’t be optimized independently. For instance,

4

ˆ

w LINOPTIMIZER(hi ; H)

n

the optimal weight vector for sentence s = 1 may

5

if ˆw 6= 0 then

be suboptimal with respect to sentence s = 2.

6

return ˆw

So we need some means to determine whether a

7

return 0

selection m = m(1) . . . m(S) 2 M = [1, N]S of

feature vectors h1,m(1) . . . hS,m(S) is extreme, that is,

An exact search algorithm for optimizing a single

whether we can find a weight vector that maximizes

N -best list is shown above. It lazily enumerates fea-

each hs,m(s). Here is a reformulation of Eq. 1 that

ture vectors in increasing order of task loss, keeping

makes this condition on extremity more explicit:

only the extreme ones. Such a vertex hj is known to

⇢ S

be on the convex hull, and the returned vector ˆ

w max-

X

ˆ

m =

arg min

E(r

imizes it. In Fig. 1, it would first run L

s, e

INOPTIMIZER

s,m(n))

(4)

conv(h[m];H)

s=1

on h

m2M

3, discard it since it is interior, and finally accept

the extreme point h1. Each execution of LINOPTI- where

S

X

MIZER requires O(N T ) time with the interior point

h[m] =

hs,m(s)

s=1

6We assume that h

[

1 . . . hN are not degenerate, i.e., that they

collectively span RD. Otherwise, all points are necessarily on

H =

h[m0]

the hull, yet some of them may not be uniquely maximized.

m02M

41 - Empirical performance:

Karmarkar (LP) vs. QuickHull

above equations represent a linear program (LP),

100000

which can be turned into canonical form

10000

QuickHull

O(N bD/2c+1)

1000

LP

maximize c| w

100

subject to Aw b

s

d

10

on

by substituting y with w|h

Sec

1

i in Eq. 3, by defining

O(ND3.5)

A = {an,d}1 n N;1 d D with an,d = hj,d

hi,d

0.1

(where hj,d is the d-th element of hj), and by setting

0.01

b = (0, . . . , 0)| = 0. The vertex hi is extreme if

0.001

and only if the LP solver finds a non-zero vector w

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

satisfying the canonical system. To ensure that w is

Dimensions

zero only when hi is interior, we set c = hi

hµ,

where

Figure 2: Running times to exactly optimize

h

N -best lists

µ is a point known to be inside the hull (e.g.,

with an increasing number of dimensions. To determine

the centroid of the N-best list).6 In the remaining

which feature vectors were on the hull, we use either linear

of this section, we use this LP formulation in func- programming (Karmarkar, 1984) or one of the most effi-

tion LINOPTIMIZER(hi; h1 . . . hN), which returns

cient convex hull computation tools (Barber et al., 1996).

the weight vector ˆ

w maximizing hi, or which returns

(Galley and Quirk, 2011)

0 if hi is interior to conv(h1 . . . hN ). We also use

23

method of (Karmarkar, 1984), and since the main

conv(hi; h1 . . . hN ) to denote whether hi is extreme

with respect to this hull.

Thursday, November 29, 12

loop may run O(N) times in the worst case, time

complexity is O(N2T ). Finally, Fig. 2 empirically

demonstrates the effectiveness of a linear program-

Algorithm 1: LP-MERT (for S = 1).

ming approach, which in practice is seldom affected

input :sent.-level feature vectors H = {h1 . . . hN}

by D.

input :sent.-level task losses E1 . . . EN, where

En := E(r1, e1,n)

3.1 Exact search: general case

output :optimal weight vector ˆw

We now extend LP-MERT to the general case, in

1 begin

. sort N -best list by increasing losses:

which we are optimizing multiple sentences at once.

2

(i

This creates an intricate optimization problem, since

1 . . . iN ) INDEXSORT(E1 . . . EN )

3

for n 1 to N do

the inner summations over n = 1 . . . N in Eq. 1

. find ˆ

w maximizing in-th element:

can’t be optimized independently. For instance,

4

ˆ

w LINOPTIMIZER(hi ; H)

n

the optimal weight vector for sentence s = 1 may

5

if ˆw 6= 0 then

be suboptimal with respect to sentence s = 2.

6

return ˆw

So we need some means to determine whether a

7

return 0

selection m = m(1) . . . m(S) 2 M = [1, N]S of

feature vectors h1,m(1) . . . hS,m(S) is extreme, that is,

An exact search algorithm for optimizing a single

whether we can find a weight vector that maximizes

N -best list is shown above. It lazily enumerates fea-

each hs,m(s). Here is a reformulation of Eq. 1 that

ture vectors in increasing order of task loss, keeping

makes this condition on extremity more explicit:

only the extreme ones. Such a vertex hj is known to

⇢ S

be on the convex hull, and the returned vector ˆ

w max-

X

ˆ

m =

arg min

E(r

imizes it. In Fig. 1, it would first run L

s, e

INOPTIMIZER

s,m(n))

(4)

conv(h[m];H)

s=1

on h

m2M

3, discard it since it is interior, and finally accept

the extreme point h1. Each execution of LINOPTI- where

S

X

MIZER requires O(N T ) time with the interior point

h[m] =

hs,m(s)

s=1

6We assume that h

[

1 . . . hN are not degenerate, i.e., that they

collectively span RD. Otherwise, all points are necessarily on

H =

h[m0]

the hull, yet some of them may not be uniquely maximized.

m02M

41 - Now, we wil consider

optimizing multiple sentences

simultaneously.

Coming back to the real world (nightmares)

Thursday, November 29, 12 - Real case: need to deal with

exponential comb. O(NS)

tuning set

candidate space

f

e

e

... e

1

1,1

2,1

S,1

f

e

e

... e

2

1,2

2,2

S,2

...

...

...

...

...

f

e

e

... e

N

1,N

2,N

S,N

Typical y, we are using thousands of sentences

(i.e., 1,000 < S < 10,000).

25

Thursday, November 29, 12 - Again, think about two dimensional

feature space (for convenience)

feature space

candidate space

e1,1 e2,1 ... eS,1

h(f,e,~)

e1,2 e2,2 ... eS,2

...

...

...

...

feature 2

e1,N e2,N ... eS,N

feature 1

O(NS) points

26

Thursday, November 29, 12 - Again, think about two dimensional

feature space (for convenience)

feature space

candidate space

e1,1 e2,1 ... eS,1

h(f,e,~)

h2,2

e1,2 e2,2 ... eS,2

...

...

...

...

feature 2

e1,N e2,N ... eS,N

feature 1

O(NS) points

26

Thursday, November 29, 12 - GOAL: We want to select a

combination of S feature vectors

feature space

so that the combination is

extreme.

feature 2

feature 1

O(NS) points

27

Thursday, November 29, 12 - Naïve approach: exhaustive search

• enumerate al possible combinations Time: O(NS)

• for each combination, check whether it is

extreme or not. Time: O(NSD3.5) per LP

Total: O(N2SD3.5)

28

Thursday, November 29, 12 - Naïve approach: exhaustive search

• enumerate al possible combinations Time: O(NS)

• for each combination, check whether it is

extreme or not. Time: O(NSD3.5) per LP

Total: O(N2SD3.5)

Hereafter, we present several improvements

28

Thursday, November 29, 12 - Naïve approach: exhaustive search

• enumerate al possible combinations Time: O(NS)

• for each combination, check whether it is

extreme or not. Time: O(NSD3.5) per LP

Sparse hypothesis combination: O(NSD3.5)

Total: O(N2SD3.5)

Hereafter, we present several improvements

28

Thursday, November 29, 12 - Naïve approach: exhaustive search

• enumerate al possible combinations Time: O(NS)

Lazy enumeration, divide-and-conquer

• for each combination, check whether it is

extreme or not. Time: O(NSD3.5) per LP

Sparse hypothesis combination: O(NSD3.5)

Total: O(N2SD3.5)

Hereafter, we present several improvements

28

Thursday, November 29, 12 - Naïve approach: exhaustive search

• enumerate al possible combinations Time: O(NS)

Lazy enumeration, divide-and-conquer

• for each combination, check whether it is

extreme or not. Time: O(NSD3.5) per LP

Sparse hypothesis combination: O(NSD3.5)

Total: O(N2SD3.5)

Hereafter, we present several improvements

28

Thursday, November 29, 12 - #1 Sparse hypothesis combination

Think about S=2, Given two N-best lists

We ask whether we can find a weight vector that maximizes

both h1,1 and h2,1.

One na¨ıve approach to address this optimization

h

problem is to enumerate all possible combinations

1,1

h

h

2,1

among the

2,2

S distinct N -best lists, determine for each

combination m whether h[m] is extreme, and return

the extreme combination with lowest total loss. It is

evident that this approach is optimal (since it follows

directly from Eq. 4), but it is prohibitively slow since

it processes

(a)

(b)

O(N S) vertices to determine whether

(Galley and Quirk, 2011)

29

they are extreme, which thus requires O(NST ) time

per LP optimization and O(N2ST ) time in total. We

Thursday, November 29, 12

now present several improvements to make this ap-

proach more practical.

3.1.1 Sparse hypothesis combination

In the na¨ıve approach presented above, each LP

computation to evaluate conv(h[m]; H) requires

O(N ST ) time since H contains N S vertices, but

Figure 3: Given two N-best lists, (a) and (b), we use

we show here how to reduce it to O(NST ) time. linear programming to determine which hypothesis com-

This improvement exploits the fact that we can elimi- binations are extreme. For instance, the combination h1,1

nate the majority of the NS points of H, since only

and h2,1 is extreme (c), while h1,1 and h2,2 is not (d).

S(N

1) + 1 are really needed to determine whether

h[m] is extreme. This is best illustrated using an ex-

ample, as shown in Fig. 3. Both

option. Instead, we use lazy enumeration to pro-

h1,1 and h2,1 in (a)

and (b) are extreme with respect to their own

cess combinations in increasing order of task loss,

N -best

list, and we ask whether we can find a weight vector

which ensures that the first extreme combination for

that maximizes both h

s = 1 . . . S that we encounter is the optimal one. An

1,1 and h2,1. The algorith-

mic trick is to geometrically translate one of the two

S-ary lazy enumeration would not be particularly ef-

ficient, since the runtime is still

N -best lists so that h

O(N S) in the worst

1,1 = h0

, where

is the

2,1

h02,1

translation of

case. LP-MERT instead uses divide-and-conquer

h0 . Then we use linear programming

2,1

with the new set of

and binary lazy enumeration, which enables us to

2N

1 points, as shown in (c), to

determine whether

discard early on combinations that are not extreme.

h1,1 is on the hull, in which case

the answer to the original question is yes. In the case

For instance, if we find that (h1,1,h2,2) is interior for

of the combination of

sentences

h

s = 1, 2, the divide-and-conquer branch

1,1 and h2,2, we see in (d) that

the combined set of points prevents the maximization

for s = 1 . . . 4 never actually receives this bad com-

bination from its left child, thus avoiding the cost

h1,1, since this point is clearly no longer on the hull.

Hence, the combination (

of enumerating combinations that are known to be

h1,1,h2,2) cannot be maxi-

mized using any linear model. This trick generalizes

interior, e.g., (h1,1,h2,2, h3,1,h4,1).

to S

2. In both (c) and (d), we used S(N

1) + 1

The LP-MERT algorithm for the general case is

points instead of NS to determine whether a given

shown as Algorithm 2. It basically only calls a re-

point is extreme. We show in the appendix that this

cursive divide-and-conquer function (GETNEXTBEST)

simplification does not sacrifice optimality.

for sentence range 1 . . . S. The latter function uses bi-

nary lazy enumeration in a manner similar to (Huang

3.1.2 Lazy enumeration, divide-and-conquer

and Chiang, 2005), and relies on two global variables:

Now that we can determine whether a given combi- I and L. The first of these, I, is used to memoize the

nation is extreme, we must next enumerate candidate

results of calls to GETNEXTBEST; given a range of

combinations to find the combination that has low- sentences and a rank n, it stores the nth best combina-

est task loss among all of those that are extreme. tion for that range of sentences. The global variable

Since the number of feature vector combinations is

L stores hypotheses combination matrices, one ma-

O(N S), exhaustive enumeration is not a reasonable

trix for each range of sentences (s, t) as shown in

42 - #1 Sparse hypothesis combination

Think about S=2, Given two N-best lists

We ask whether we can find a weight vector that maximizes

both h1,1 and h2,1.

Geometrical y, translate a N-best list so that h1,1 = h‘2,1.

One na¨ıve approach to address this optimization

h

problem is to enumerate all possible combinations

1,1

h

h

2,1

among the

2,2

S distinct N -best lists, determine for each

combination m whether h[m] is extreme, and return

the extreme combination with lowest total loss. It is

evident that this approach is optimal (since it follows

directly from Eq. 4), but it is prohibitively slow since

it processes

(a)

(b)

O(N S) vertices to determine whether

(Galley and Quirk, 2011)

29

they are extreme, which thus requires O(NST ) time

per LP optimization and O(N2ST ) time in total. We

Thursday, November 29, 12

now present several improvements to make this ap-

proach more practical.

3.1.1 Sparse hypothesis combination

In the na¨ıve approach presented above, each LP

computation to evaluate conv(h[m]; H) requires

O(N ST ) time since H contains N S vertices, but

Figure 3: Given two N-best lists, (a) and (b), we use

we show here how to reduce it to O(NST ) time. linear programming to determine which hypothesis com-

This improvement exploits the fact that we can elimi- binations are extreme. For instance, the combination h1,1

nate the majority of the NS points of H, since only

and h2,1 is extreme (c), while h1,1 and h2,2 is not (d).

S(N

1) + 1 are really needed to determine whether

h[m] is extreme. This is best illustrated using an ex-

ample, as shown in Fig. 3. Both

option. Instead, we use lazy enumeration to pro-

h1,1 and h2,1 in (a)

and (b) are extreme with respect to their own

cess combinations in increasing order of task loss,

N -best

list, and we ask whether we can find a weight vector

which ensures that the first extreme combination for

that maximizes both h

s = 1 . . . S that we encounter is the optimal one. An

1,1 and h2,1. The algorith-

mic trick is to geometrically translate one of the two

S-ary lazy enumeration would not be particularly ef-

ficient, since the runtime is still

N -best lists so that h

O(N S) in the worst

1,1 = h0

, where

is the

2,1

h02,1

translation of

case. LP-MERT instead uses divide-and-conquer

h0 . Then we use linear programming

2,1

with the new set of

and binary lazy enumeration, which enables us to

2N

1 points, as shown in (c), to

determine whether

discard early on combinations that are not extreme.

h1,1 is on the hull, in which case

the answer to the original question is yes. In the case

For instance, if we find that (h1,1,h2,2) is interior for

of the combination of

sentences

h

s = 1, 2, the divide-and-conquer branch

1,1 and h2,2, we see in (d) that

the combined set of points prevents the maximization

for s = 1 . . . 4 never actually receives this bad com-

bination from its left child, thus avoiding the cost

h1,1, since this point is clearly no longer on the hull.

Hence, the combination (

of enumerating combinations that are known to be

h1,1,h2,2) cannot be maxi-

mized using any linear model. This trick generalizes

interior, e.g., (h1,1,h2,2, h3,1,h4,1).

to S

2. In both (c) and (d), we used S(N

1) + 1

The LP-MERT algorithm for the general case is

points instead of NS to determine whether a given

shown as Algorithm 2. It basically only calls a re-

point is extreme. We show in the appendix that this

cursive divide-and-conquer function (GETNEXTBEST)

simplification does not sacrifice optimality.

for sentence range 1 . . . S. The latter function uses bi-

nary lazy enumeration in a manner similar to (Huang

3.1.2 Lazy enumeration, divide-and-conquer

and Chiang, 2005), and relies on two global variables:

Now that we can determine whether a given combi- I and L. The first of these, I, is used to memoize the

nation is extreme, we must next enumerate candidate

results of calls to GETNEXTBEST; given a range of

combinations to find the combination that has low- sentences and a rank n, it stores the nth best combina-

est task loss among all of those that are extreme. tion for that range of sentences. The global variable

Since the number of feature vector combinations is

L stores hypotheses combination matrices, one ma-

O(N S), exhaustive enumeration is not a reasonable

trix for each range of sentences (s, t) as shown in

42 - One na¨ıve approach to address this optimization

problem is to enumerate all possible #1 Spars

combinations e hypothesis combination

among the S distinct N-best lists, determine for each

Think about S=2, Given two N-best lists

combination m whether h[m] is extreme, and return

the extreme combination with lowest

Then,

total loss. w

It e

isuse LP with the new set of 2N-1 points

evident that this approach is optimal

to

(since it determi

follows ne h1,1 is on the convex hull

directly from Eq. 4), but it is prohibitively (i.e

slo .,

w check

since whether h1,1 is extreme)

it processes O(NS) vertices to determine whether

they are extreme, which thus requires O(NST ) time

h1,1 h’2,1

per LP optimization and O(N2ST ) time in total. We

now present several improvements to make this ap-

proach more practical.

3.1.1 Sparse hypothesis combination

In the na¨ıve approach presented above, each LP

(c)

computation to evaluate conv(h[m]; H) requires

(Galley and Quirk, 2011)

30

O(N ST ) time since H contains N S vertices,

Thursday, November 29, 12

but

Figure 3: Given two N-best lists, (a) and (b), we use

we show here how to reduce it to O(NST ) time. linear programming to determine which hypothesis com-

This improvement exploits the fact that we can elimi- binations are extreme. For instance, the combination h1,1

nate the majority of the NS points of H, since only

and h2,1 is extreme (c), while h1,1 and h2,2 is not (d).

S(N

1) + 1 are really needed to determine whether

h[m] is extreme. This is best illustrated using an ex-

ample, as shown in Fig. 3. Both

option. Instead, we use lazy enumeration to pro-

h1,1 and h2,1 in (a)

and (b) are extreme with respect to their own

cess combinations in increasing order of task loss,

N -best

list, and we ask whether we can find a weight vector

which ensures that the first extreme combination for

that maximizes both h

s = 1 . . . S that we encounter is the optimal one. An

1,1 and h2,1. The algorith-

mic trick is to geometrically translate one of the two

S-ary lazy enumeration would not be particularly ef-

ficient, since the runtime is still

N -best lists so that h

O(N S) in the worst

1,1 = h0

, where

is the

2,1

h02,1

translation of

case. LP-MERT instead uses divide-and-conquer

h0 . Then we use linear programming

2,1

with the new set of

and binary lazy enumeration, which enables us to

2N

1 points, as shown in (c), to

determine whether

discard early on combinations that are not extreme.

h1,1 is on the hull, in which case

the answer to the original question is yes. In the case

For instance, if we find that (h1,1,h2,2) is interior for

of the combination of

sentences

h

s = 1, 2, the divide-and-conquer branch

1,1 and h2,2, we see in (d) that

the combined set of points prevents the maximization

for s = 1 . . . 4 never actually receives this bad com-

bination from its left child, thus avoiding the cost

h1,1, since this point is clearly no longer on the hull.

Hence, the combination (

of enumerating combinations that are known to be

h1,1,h2,2) cannot be maxi-

mized using any linear model. This trick generalizes

interior, e.g., (h1,1,h2,2, h3,1,h4,1).

to S

2. In both (c) and (d), we used S(N

1) + 1

The LP-MERT algorithm for the general case is

points instead of NS to determine whether a given

shown as Algorithm 2. It basically only calls a re-

point is extreme. We show in the appendix that this

cursive divide-and-conquer function (GETNEXTBEST)

simplification does not sacrifice optimality.

for sentence range 1 . . . S. The latter function uses bi-

nary lazy enumeration in a manner similar to (Huang

3.1.2 Lazy enumeration, divide-and-conquer

and Chiang, 2005), and relies on two global variables:

Now that we can determine whether a given combi- I and L. The first of these, I, is used to memoize the

nation is extreme, we must next enumerate candidate

results of calls to GETNEXTBEST; given a range of

combinations to find the combination that has low- sentences and a rank n, it stores the nth best combina-

est task loss among all of those that are extreme. tion for that range of sentences. The global variable

Since the number of feature vector combinations is

L stores hypotheses combination matrices, one ma-

O(N S), exhaustive enumeration is not a reasonable

trix for each range of sentences (s, t) as shown in

42 - One na¨ıve approach to address this optimization

problem is to enumerate all possible #1 Spars

combinations e hypothesis combination

among the S distinct N-best lists, determine for each

Think about S=2, Given two N-best lists

combination m whether h[m] is extreme, and return

the extreme combination with lowest

Then,

total loss. w

It e

isuse LP with the new set of 2N-1 points

evident that this approach is optimal

to

(since it determi

follows ne h1,1 is on the convex hull

directly from Eq. 4), but it is prohibitively (i.e

slo .,

w check

since whether h1,1 is extreme)

it processes O(NS) vertices to determine whether

they are extreme, which thus requires

yes

O(N ST ) time

h1,1 h’2,1

per LP optimization and O(N2ST ) time in total. We

now present several improvements to make this ap-

proach more practical.

3.1.1 Sparse hypothesis combination

In the na¨ıve approach presented above, each LP

(c)

computation to evaluate conv(h[m]; H) requires

(Galley and Quirk, 2011)

30

O(N ST ) time since H contains N S vertices,

Thursday, November 29, 12

but

Figure 3: Given two N-best lists, (a) and (b), we use

we show here how to reduce it to O(NST ) time. linear programming to determine which hypothesis com-

This improvement exploits the fact that we can elimi- binations are extreme. For instance, the combination h1,1

nate the majority of the NS points of H, since only

and h2,1 is extreme (c), while h1,1 and h2,2 is not (d).

S(N

1) + 1 are really needed to determine whether

h[m] is extreme. This is best illustrated using an ex-

ample, as shown in Fig. 3. Both

option. Instead, we use lazy enumeration to pro-

h1,1 and h2,1 in (a)

and (b) are extreme with respect to their own

cess combinations in increasing order of task loss,

N -best

list, and we ask whether we can find a weight vector

which ensures that the first extreme combination for

that maximizes both h

s = 1 . . . S that we encounter is the optimal one. An

1,1 and h2,1. The algorith-

mic trick is to geometrically translate one of the two

S-ary lazy enumeration would not be particularly ef-

ficient, since the runtime is still

N -best lists so that h

O(N S) in the worst

1,1 = h0

, where

is the

2,1

h02,1

translation of

case. LP-MERT instead uses divide-and-conquer

h0 . Then we use linear programming

2,1

with the new set of

and binary lazy enumeration, which enables us to

2N

1 points, as shown in (c), to

determine whether

discard early on combinations that are not extreme.

h1,1 is on the hull, in which case

the answer to the original question is yes. In the case

For instance, if we find that (h1,1,h2,2) is interior for

of the combination of

sentences

h

s = 1, 2, the divide-and-conquer branch

1,1 and h2,2, we see in (d) that

the combined set of points prevents the maximization

for s = 1 . . . 4 never actually receives this bad com-

bination from its left child, thus avoiding the cost

h1,1, since this point is clearly no longer on the hull.

Hence, the combination (

of enumerating combinations that are known to be

h1,1,h2,2) cannot be maxi-

mized using any linear model. This trick generalizes

interior, e.g., (h1,1,h2,2, h3,1,h4,1).

to S

2. In both (c) and (d), we used S(N

1) + 1

The LP-MERT algorithm for the general case is

points instead of NS to determine whether a given

shown as Algorithm 2. It basically only calls a re-

point is extreme. We show in the appendix that this

cursive divide-and-conquer function (GETNEXTBEST)

simplification does not sacrifice optimality.

for sentence range 1 . . . S. The latter function uses bi-

nary lazy enumeration in a manner similar to (Huang

3.1.2 Lazy enumeration, divide-and-conquer

and Chiang, 2005), and relies on two global variables:

Now that we can determine whether a given combi- I and L. The first of these, I, is used to memoize the

nation is extreme, we must next enumerate candidate

results of calls to GETNEXTBEST; given a range of

combinations to find the combination that has low- sentences and a rank n, it stores the nth best combina-

est task loss among all of those that are extreme. tion for that range of sentences. The global variable

Since the number of feature vector combinations is

L stores hypotheses combination matrices, one ma-

O(N S), exhaustive enumeration is not a reasonable

trix for each range of sentences (s, t) as shown in

42 - How about this?

One na¨ıve approach to address this optimization

h

problem is to enumerate all possible combinations

1,1

h

One na¨ıve approach to address this optimization

h

2,1

among the

2,2

S distinct N -best lists, determine for each

problem is to enumerate all possible combinations

combination m whether h[m] is extreme, and return

among the S distinct N-best lists, determine for each

the extreme combination with lowest total loss. It is

combination m whether h[m] is extreme, and return

evident that this approach is optimal (since it follows

the extreme combination with lowest total loss. It is

directly from Eq. 4), but it is prohibitively slow since

evident that this approach is optimal (since it follows

it processes

(a)

(b)

O(N S) vertices to determine whether

directly from Eq. 4), but it is prohibitively slow since

they are extreme, which thus requires O(NST ) time

it processes per

O(N SLP) optimization

vertices to and

determine whether Is (h

h

O(N 2ST ) time in total.

1, W

1, e2,2) extreme?

they are e

now

xtreme, present

which

se

thus veral impro

requires O(vements

N ST )

to

timemake this ap-

h1,1 h’2,2

per LP

proach

optimization more

and O practical.

(N 2ST ) time in total. We

now present several improvements to make this ap-

proach more 3.1.1 Sparse

practical.

hypothesis combination

In the na¨ıve approach presented above, each LP

3.1.1 Sparse hypothesis combination

computation to evaluate conv(h[m]; H) requires

In the na¨ıv O

e (NST) time

approach

since

presented H contains

above,

N

each S

LP vertices, but Figure 3: Given two N-best(d)lists,

(a) and

(G (b), we

alley an use

d Quirk, 2011)

31

computation we

to sho

ev w here

aluate

ho

convw

(hto

[ reduce

m]; H) it to O

requires(NST) time. linear programming to determine which hypothesis com-

O(N ST )

This

time

impro

since H vement exploits

contains NS vthe fact

ertices, that

but we can elimi-

Figure

Thursday, November 29, 12 3: Gi binations

ven two are

N extreme.

-best lists, For

(a) instance,

and (b), the

we combinat

use

ion h1,1

we show

nate

here ho the

w tomajority

reduce of

it the

to N

O( S

N points

ST )

of H

time. , since

linearonly

progr and h2

amming ,1

to is extreme

determine (c), while

which h h1,1 and

ypothesis h2,2

com-is not (d).

This improv S(N

ement e 1)+

xploits1 are

the f really

act

needed

that we

to

can determine

elimi-

whether

binations are extreme. For instance, the combination h1,1

nate the

h[m

majority ]

of is extreme.

the NS

This

points is

of best

H, illustrated

since only using

and an

h2, e1x-

is extreme (c), while h1,1 and h2,2 is not (d).

S(N

1) + 1 ample,

are

as

really shown

needed in

to Fig. 3. Both

option. Instead, we use lazy enumeration to pro-

determine h

whether

1,1 and h2,1 in (a)

h[m] is e

and

xtreme. (b) are

This is extreme

best

with respect

illustrated using to their

an ex- own

cess combinations in increasing order of task loss,

N -best

ample, as sho list,

wn and

in

we

Fig. ask

3.

whether

Both

option. Instead, we use lazy enumeration to pro-

h1,1 we

and can

h2, find

1 in a weight

(a)

vector

which ensures that the first extreme combination for

and (b) are e that maximizes

xtreme with

both

respect to

cess combinations in increasing order of task loss,

h

s = 1 . . . S that we encounter is the optimal one. An

1,1

their and

own h

N 2,1.

-bestThe algorith-

list, and we

mic

ask

trick is

whether to

we geometrically

can find a

translate

weight v

one

ector

of the

whichtwo S

ensures -ary

that lazy

the enumeration

first extreme would not be

combination particularly

for

ef-

s = 1 . . . S ficient,

that we since the

encounterruntime

is the is still

that

optimal one. An

N

maximizes -best

both lists

h1,1 so that

and h h

O(N S) in the worst

2, 1

1.,1 = h0

The

, where

algorith-

is the

2,1

h02,1

mic trick is to translation of

S-ary lazy case. LP-MER

enumeration w T

ould instead

not be uses divide-and-conquer

h0

geometrically

. Then

translate we

one use

of linear

the twoprogramming

particularly ef-

2,1

ficient, since the runtime is still

N -best lists with

so

the

that h ne

O(N S) in the worst

1,1w set

= of

and binary lazy enumeration, which enables us to

h0 2,N

1

wherepoints,

h0

as

is sho

the wn in (c), to

2,1

2,1

translation of determine

case. LP-MERT instead uses divide-and-conquer

h0 . Then whether

discard early on combinations that are not extreme.

we use h1,1 is

linear on the hull, in

programming which case

2,1

with the new the

set answer

of

and binary lazy enumeration, which enables us to

2N

to

1 the original

points, as

question

shown in is

(c), yes.

to In the case

For instance, if we find that (h1,1,h2,2) is interior for

determine

of the

whether combination

discard early on combinations that are not extreme.

h

of

sentences

h

s = 1, 2, the divide-and-conquer branch

1,1 and h2,2, we see in (d) that

1,1 is on the hull, in which case

the answer to the

the combined

original set of points

question is

pre

yes. v

In ents

the the maximization

For instance,for s = 1 . . . 4

if we find that ne

( v

h er actually receives this bad com-

case

1,1,h2,2) is interior for

bination from its left child, thus avoiding the cost

h

of the

sentences s = 1, 2, the divide-and-conquer branch

1,1, since this point is clearly no longer on the hull.

combination of h1,1 and h2,2, we see in (d) that

Hence, the combination (

of enumerating combinations that are known to be

h

for s = 1 . . . 4 never actually receives this bad com-

1,1,h2,2) cannot be maxi-

the combined set of points prevents the maximization

mized using any linear model. This trick generalizes

interior, e.g., (h1,1,h2,2, h3,1,h4,1).

bination from its left child, thus avoiding the cost

h1,1, since this point is clearly no longer on the hull.

to S

2. In both (c) and (d), we used S(Nof 1) + 1

The

enumerating

LP-MERT

combinations algorithm

that are

for

kno the

wn togeneral

be

case is

Hence, the combination (h

points instead1,of

1,h2,2) cannot be maxi-

N S to determine whether a giv

interioren

shown as Algorithm 2. It basically only calls a re-

mized using any linear model. This trick generalizes

, e.g., (h1,1,h2,2, h3,1,h4,1).

point is extreme. We show in the appendix that this

cursive divide-and-conquer function (GETNEXTBEST)

to S

2. In both (c) and (d), we used S(N

1) + 1

The LP-MERT algorithm for the general case is

simplification does not sacrifice optimality.

for sentence range 1 . . . S. The latter function uses bi-

points instead of NS to determine whether a given

shown as Algorithm 2. It basically only calls a re-

nary lazy enumeration in a manner similar to (Huang

point is e

3.1.2

xtreme. We Lazy

show enumeration,

in the

di

appendix vide-and-conquer

that this

cursive divide-and-conquer function (GETNEXTBEST)

and Chiang, 2005), and relies on two global variables:

simplification

for sentence range 1 . . . S. The latter function uses bi-

No

does w that

not

we can

sacrifice determine

optimality.whether a given combi- I and L. The first of these, I, is used to memoize the

nary lazy enumeration in a manner similar to (Huang

nation is extreme, we must next enumerate candidate

results of calls to GETNEXTBEST; given a range of

3.1.2 Lazy enumeration, divide-and-conquer

and Chiang, 2005), and relies on two global variables:

combinations to find the combination that has low- sentences and a rank n, it stores the nth best combina-

Now that we can determine whether a given combi-

est task loss among all of those that areI and

extreme.

L. The first of these,

tion for that range

I, is used to memoize the

of sentences. The global variable

nation is e

Since

xtreme,

the

we

number

must next of feature

enumerate vector combinations

candidate

results is

of calls to GETNEXTBEST; given a range of

L stores hypotheses combination matrices, one ma-

combinations to find the combination that has low- sentences and a rank n, it stores the nth best combina-

O(N S), exhaustive enumeration is not a reasonable

trix for each range of sentences (s, t) as shown in

est task loss among all of those that are extreme. tion for that range of sentences. The global variable

42

Since the number of feature vector combinations is

L stores hypotheses combination matrices, one ma-

O(N S), exhaustive enumeration is not a reasonable

trix for each range of sentences (s, t) as shown in

42 - How about this?

One na¨ıve approach to address this optimization

h

problem is to enumerate all possible combinations

1,1

h

One na¨ıve approach to address this optimization

h

2,1

among the

2,2

S distinct N -best lists, determine for each

problem is to enumerate all possible combinations

combination m whether h[m] is extreme, and return

among the S distinct N-best lists, determine for each

the extreme combination with lowest total loss. It is

combination m whether h[m] is extreme, and return

evident that this approach is optimal (since it follows

the extreme combination with lowest total loss. It is

directly from Eq. 4), but it is prohibitively slow since

evident that this approach is optimal (since it follows

it processes

(a)

(b)

O(N S) vertices to determine whether

directly from Eq. 4), but it is prohibitively slow since

they are extreme, which thus requires O(NST ) time

it processes per

O(N SLP) optimization

vertices to and

determine whether Is (h

h

O(N 2ST ) time in total.

1, W

1, e2,2) extreme?

they are e

now

xtreme, present

which

se

thus veral impro

requires O(vements

N ST )

to

timemake this ap-

h1,1 h’2,2

per LP

No

proach

optimization more

and O practical.

(N 2ST ) time in total. We

now present several improvements to make this ap-

proach more 3.1.1 Sparse

practical.

hypothesis combination

In the na¨ıve approach presented above, each LP

3.1.1 Sparse hypothesis combination

computation to evaluate conv(h[m]; H) requires

In the na¨ıv O

e (NST) time

approach

since

presented H contains

above,

N

each S

LP vertices, but Figure 3: Given two N-best(d)lists,

(a) and

(G (b), we

alley an use

d Quirk, 2011)

31

computation we

to sho

ev w here

aluate

ho

convw

(hto

[ reduce

m]; H) it to O

requires(NST) time. linear programming to determine which hypothesis com-

O(N ST )

This

time

impro

since H vement exploits

contains NS vthe fact

ertices, that

but we can elimi-

Figure

Thursday, November 29, 12 3: Gi binations

ven two are

N extreme.

-best lists, For

(a) instance,

and (b), the

we combinat

use

ion h1,1

we show

nate

here ho the

w tomajority

reduce of

it the

to N

O( S

N points

ST )

of H

time. , since

linearonly

progr and h2

amming ,1

to is extreme

determine (c), while

which h h1,1 and

ypothesis h2,2

com-is not (d).

This improv S(N

ement e 1)+

xploits1 are

the f really

act

needed

that we

to

can determine

elimi-

whether

binations are extreme. For instance, the combination h1,1

nate the

h[m

majority ]

of is extreme.

the NS

This

points is

of best

H, illustrated

since only using

and an

h2, e1x-

is extreme (c), while h1,1 and h2,2 is not (d).

S(N

1) + 1 ample,

are

as

really shown

needed in

to Fig. 3. Both

option. Instead, we use lazy enumeration to pro-

determine h

whether

1,1 and h2,1 in (a)

h[m] is e

and

xtreme. (b) are

This is extreme

best

with respect

illustrated using to their

an ex- own

cess combinations in increasing order of task loss,

N -best

ample, as sho list,

wn and

in

we

Fig. ask

3.

whether

Both

option. Instead, we use lazy enumeration to pro-

h1,1 we

and can

h2, find

1 in a weight

(a)

vector

which ensures that the first extreme combination for

and (b) are e that maximizes

xtreme with

both

respect to

cess combinations in increasing order of task loss,

h

s = 1 . . . S that we encounter is the optimal one. An

1,1

their and

own h

N 2,1.

-bestThe algorith-

list, and we

mic

ask

trick is

whether to

we geometrically

can find a

translate

weight v

one

ector

of the

whichtwo S

ensures -ary

that lazy

the enumeration

first extreme would not be

combination particularly

for

ef-

s = 1 . . . S ficient,

that we since the

encounterruntime

is the is still

that

optimal one. An

N

maximizes -best

both lists

h1,1 so that

and h h

O(N S) in the worst

2, 1

1.,1 = h0

The

, where

algorith-

is the

2,1

h02,1

mic trick is to translation of

S-ary lazy case. LP-MER

enumeration w T

ould instead

not be uses divide-and-conquer

h0

geometrically

. Then

translate we

one use

of linear

the twoprogramming

particularly ef-

2,1

ficient, since the runtime is still

N -best lists with

so

the

that h ne

O(N S) in the worst

1,1w set

= of

and binary lazy enumeration, which enables us to

h0 2,N

1

wherepoints,

h0

as

is sho

the wn in (c), to

2,1

2,1

translation of determine

case. LP-MERT instead uses divide-and-conquer

h0 . Then whether

discard early on combinations that are not extreme.

we use h1,1 is

linear on the hull, in

programming which case

2,1

with the new the

set answer

of

and binary lazy enumeration, which enables us to

2N

to

1 the original

points, as

question

shown in is

(c), yes.

to In the case

For instance, if we find that (h1,1,h2,2) is interior for

determine

of the

whether combination

discard early on combinations that are not extreme.

h

of

sentences

h

s = 1, 2, the divide-and-conquer branch

1,1 and h2,2, we see in (d) that

1,1 is on the hull, in which case

the answer to the

the combined

original set of points

question is

pre

yes. v

In ents

the the maximization

For instance,for s = 1 . . . 4

if we find that ne

( v

h er actually receives this bad com-

case

1,1,h2,2) is interior for

bination from its left child, thus avoiding the cost

h

of the

sentences s = 1, 2, the divide-and-conquer branch

1,1, since this point is clearly no longer on the hull.

combination of h1,1 and h2,2, we see in (d) that

Hence, the combination (

of enumerating combinations that are known to be

h

for s = 1 . . . 4 never actually receives this bad com-

1,1,h2,2) cannot be maxi-

the combined set of points prevents the maximization

mized using any linear model. This trick generalizes

interior, e.g., (h1,1,h2,2, h3,1,h4,1).

bination from its left child, thus avoiding the cost

h1,1, since this point is clearly no longer on the hull.

to S

2. In both (c) and (d), we used S(Nof 1) + 1

The

enumerating

LP-MERT

combinations algorithm

that are

for

kno the

wn togeneral

be

case is

Hence, the combination (h

points instead1,of

1,h2,2) cannot be maxi-

N S to determine whether a giv

interioren

shown as Algorithm 2. It basically only calls a re-

mized using any linear model. This trick generalizes

, e.g., (h1,1,h2,2, h3,1,h4,1).

point is extreme. We show in the appendix that this

cursive divide-and-conquer function (GETNEXTBEST)

to S

2. In both (c) and (d), we used S(N

1) + 1

The LP-MERT algorithm for the general case is

simplification does not sacrifice optimality.

for sentence range 1 . . . S. The latter function uses bi-

points instead of NS to determine whether a given

shown as Algorithm 2. It basically only calls a re-

nary lazy enumeration in a manner similar to (Huang

point is e

3.1.2

xtreme. We Lazy

show enumeration,

in the

di

appendix vide-and-conquer

that this

cursive divide-and-conquer function (GETNEXTBEST)

and Chiang, 2005), and relies on two global variables:

simplification

for sentence range 1 . . . S. The latter function uses bi-

No

does w that

not

we can

sacrifice determine

optimality.whether a given combi- I and L. The first of these, I, is used to memoize the

nary lazy enumeration in a manner similar to (Huang

nation is extreme, we must next enumerate candidate

results of calls to GETNEXTBEST; given a range of

3.1.2 Lazy enumeration, divide-and-conquer

and Chiang, 2005), and relies on two global variables:

combinations to find the combination that has low- sentences and a rank n, it stores the nth best combina-

Now that we can determine whether a given combi-

est task loss among all of those that areI and

extreme.

L. The first of these,

tion for that range

I, is used to memoize the

of sentences. The global variable

nation is e

Since

xtreme,

the

we

number

must next of feature

enumerate vector combinations

candidate

results is

of calls to GETNEXTBEST; given a range of

L stores hypotheses combination matrices, one ma-

combinations to find the combination that has low- sentences and a rank n, it stores the nth best combina-

O(N S), exhaustive enumeration is not a reasonable

trix for each range of sentences (s, t) as shown in

est task loss among all of those that are extreme. tion for that range of sentences. The global variable

42

Since the number of feature vector combinations is

L stores hypotheses combination matrices, one ma-

O(N S), exhaustive enumeration is not a reasonable

trix for each range of sentences (s, t) as shown in

42 - This trick generalizes to S ≧ 2

• We used S(N-1) + 1 points instead of NS to

determine whether a given point is extreme.

• This trick does not sacrifice the optimality of

LP-MERT. See appendix for details.

32

Thursday, November 29, 12 - Naïve approach: exhaustive search

• enumerate al possible combinations Time: O(NS)

Lazy enumeration, divide-and-conquer

• for each combination, check whether it is

extreme or not. Time: O(NSD3.5) per LP

Sparse hypothesis combination: O(NSD3.5)

Total: O(N2SD3.5) O(NS+1SD3.5)

33

Thursday, November 29, 12 - Lazy enumeration

• Process combinations in increasing order of task

loss

- For each N-best lists, compute task loss & sort all points

The case of S = 2

34

Thursday, November 29, 12 - Lazy enumeration

• Process combinations in increasing order of task

loss

- For each N-best lists, compute task loss & sort all points

0.42

0.37

0.53

0.5

0.28

0.21

0.35

0.45

0.31

0.3

The case of S = 2

34

Thursday, November 29, 12 - Lazy enumeration

• Process combinations in increasing order of task

loss

- For each N-best lists, compute task loss & sort all points

0.

③ 42

④ 0.37

⑥

⑤

0.53

②

0.5

0.28

①

②

0.21

④

0.35

③

0.45

0.31

①0.3

The case of S = 2

34

Thursday, November 29, 12 - Lazy enumeration

• Process combinations in increasing order of task

loss

- For each N-best lists, compute task loss & sort all points

0.

③ 42

④ 0.37

⑥

⑤

0.53

②

0.5

0.28

①

②

0.21

④

0.35

③

0.45

0.31

①0.3

The case of S = 2

34

Thursday, November 29, 12 - Lazy enumeration

• Process combinations in increasing order of task

loss

- For each N-best lists, compute task loss & sort all points

0.

③ 42

④ 0.37

⑥

⑤

0.53

②

0.5

0.28

①

②

0.21

④

0.35

③

0.45

0.31

①0.3

The case of S = 2

34

Thursday, November 29, 12

• Process combinations in increasing order of task

loss

- For each N-best lists, compute task loss & sort all points

0.

③ 42

④ 0.37

⑥

⑤

0.53

②

0.5

0.28

①

②

0.21

④

0.35

③

0.45

0.31

①0.3

The case of S = 2

34

Thursday, November 29, 12

• Process combinations in increasing order of task

loss

- For each N-best lists, compute task loss & sort all points

0.

③ 42

④ 0.37

⑥

⑤

0.53

②

0.5

0.28

①

②

0.21

④

0.35

③

0.45

0.31

①0.3

The case of S = 2

34

Thursday, November 29, 12

• Process combinations in increasing order of task

loss

- For each N-best lists, compute task loss & sort all points

0.

③ 42

④ 0.37

⑥

⑤

0.53

②

0.5

0.28

①

②

0.21

④

0.35

③

0.45

0.31

①0.3

The case of S = 2

34

Thursday, November 29, 12- Lazy enumeration

• Process combinations in increasing order of task

loss

- For each N-best lists, compute task loss & sort all points

Sparse hypothesis combination 0.

③ 42

④ 0.37

⑥

⑤

0.53

②

0.5

0.28

①

②

0.21

④

0.35

③

0.45

0.31

①0.3

The case of S = 2

34

Thursday, November 29, 12 - Lazy enumeration

• Process combinations in increasing order of task

loss

- For each N-best lists, compute task loss & sort all points

this ensures first extreme combination is optimal!

Sparse hypothesis combination 0.

③ 42

④ 0.37

⑥

⑤

0.53

②

0.5

0.28

①

②

0.21

④

0.35

③

0.45

0.31

①0.3

The case of S = 2

34

Thursday, November 29, 12 - Lazy enumeration

• Process combinations in increasing order of task

loss

- For each N-best lists, compute task loss & sort all points

O(NS)

because we assumed that

task loss can be computed at

sentence-level.

35

Thursday, November 29, 12 - Lazy enumeration

• Process combinations in increasing order of task

loss

- For each N-best lists, compute task loss & sort all points

O(NS)

O(SNlogN)

because we assumed that

task loss can be computed at

sentence-level.

35

Thursday, November 29, 12 - Lazy enumeration

• Process combinations in increasing order of task

loss

- For each N-best lists, compute task loss & sort all points

O(SNlogN)

O(NS)

O(SNlogN)

because we assumed that

task loss can be computed at

sentence-level.

35

Thursday, November 29, 12 - Lazy enumeration

• Process combinations in increasing order of task

loss

- For each N-best lists, compute task loss & sort all points

O(SNlogN)

O(NS)

O(SNlogN)

because we assumed that

task loss can be computed at

sentence-level.

N.B.: for non-decomposable metric, O(NS)

35

Thursday, November 29, 12 - Binary lazy enumeration

• Divide and conquer: split N-best lists into

smal er, compute extreme, and merge

two N-best lists.

similar to Algorithm 3 (Huang & Chiang, 2005)

• because S-ary lazy enumeration runs in

O(NS)

36

Thursday, November 29, 12 - Binary lazy enumeration

The case of 4 sentences

{f1, f2, f3, f4}

37

Thursday, November 29, 12 - Binary lazy enumeration

The case of 4 sentences

{f1, f2, f3, f4}

37

Thursday, November 29, 12 - Binary lazy enumeration

The case of 4 sentences

{f1, f2, f3, f4}

{f1, f2}

{f3, f4}

37

Thursday, November 29, 12 - Binary lazy enumeration

The case of 4 sentences

{f1, f2, f3, f4}

{f1, f2}

{f3, f4}

37

Thursday, November 29, 12 - Binary lazy enumeration

The case of 4 sentences

{f1, f2, f3, f4}

{f1, f2}

{f3, f4}

f1

f2

37

Thursday, November 29, 12 - Binary lazy enumeration

The case of 4 sentences

{f1, f2, f3, f4}

{f1, f2}

{f3, f4}

f1

f2

37

Thursday, November 29, 12 - Binary lazy enumeration

The case of 4 sentences

{f1, f2, f3, f4}

{f1, f2}

{f3, f4}

f1

f2

f3

f4

37

Thursday, November 29, 12 - Binary lazy enumeration

The case of 4 sentences

{f1, f2, f3, f4}

{f1, f2}

{f3, f4}

f1

f2

f3

f4

h11

37

Thursday, November 29, 12 - Binary lazy enumeration

The case of 4 sentences

{f1, f2, f3, f4}

{f1, f2}

{f3, f4}

f1

f2

f3

f4

h21

h11

37

Thursday, November 29, 12 - Binary lazy enumeration

The case of 4 sentences

{f1, f2, f3, f4}

{f1, f2}

{f3, f4}

f1

f2

f3

f4

h21

h

h31

11

37

Thursday, November 29, 12 - Binary lazy enumeration

The case of 4 sentences

{f1, f2, f3, f4}

{f1, f2}

{f3, f4}

f1

f2

f3

f4

h21

h

h31

11

h41

37

Thursday, November 29, 12 - Binary lazy enumeration

The case of 4 sentences

{f1, f2, f3, f4}

{f1, f2}

{f3, f4}

f1

f2

f3

f4

h21

h

h31

11

h41

NOTE: For each N-best lists, task loss is computed & all

points are sorted in increasing order.

37

Thursday, November 29, 12 - Binary lazy enumeration

The case of 4 sentences

{f1, f2, f3, f4}

{f1, f2}

{f3, f4}

f1

f2

f3

f4

h21

h

h31

11

h41

NOTE: For each N-best lists, task loss is computed & all

points are sorted in increasing order.

37

Thursday, November 29, 12 - Binary lazy enumeration

The case of 4 sentences

{f1, f2, f3, f4}

compute the sum of

{f1, f2}

the lo {f

ss3, f

es 4 }h11 and h21

h21 h22 h23 h24

f1

f2

h21

h11 69.1

h

h12

11

E11 + E21

table stores the losses of

combinations

38

Thursday, November 29, 12 - Binary lazy enumeration

The case of 4 sentences

{f1, f2, f3, f4}

{f1, f2}

{f3, f4}

h21

h

h22 h23 h24

f1

f2

h21

h11

h

69.1

h

h12

11

E11

E

11 +

E

+ 21

combine & check the

combination is extreme

39

Thursday, November 29, 12 - Binary lazy enumeration

The case of 4 sentences

{f1, f2, f3, f4}

{f1, f2}

{f3, f4}

h21

h

h22 h23 h24

f1

f2

h21

h11

h

69.1

h

h12

11

E11

E

11 +

E

+ 21

combine & check the

combination is extreme

39

Thursday, November 29, 12 - Binary lazy enumeration

The case of 4 sentences

{f1, f2, f3, f4}

{f1, f2}

{f3, f4}

h21

h

h22 h23 h24

f1

f2

h21

h11

h

69.1

h

h12

11

E11

E

11 +

E

+ 21

combine & check the

combination is extreme

39

Thursday, November 29, 12 - Binary lazy enumeration

The case of 4 sentences

{f1, f2, f3, f4}

{f1, f2}

{f3, f4}

h21

h

h22 h23 h24

f1

f2

h21

h11

h

69.1

h

h12

11

E11

E

11 +

E

+ 21

combine & check the

combination is extreme

In this case, interior

39

Thursday, November 29, 12 - Binary lazy enumeration

The case of 4 sentences

{f1, f2, f3, f4}

{f1, f2}

{f3, f4}

h21

h

h22 h23 h24

f1

f2

h21

h11

h

69.1

h

h12

11

E11

E

11 +

E

+ 21

combine & check the

combination is extreme

In this case, interior

39

Thursday, November 29, 12 - Binary lazy enumeration

The case of 4 sentences

{f1, f2, f3, f4}

{f1, f2}

{f3, f4}

h21 h22

f1

f2

h21

h11 69.1

h

h12

11

enumerate the next best

combination (ExpandFrontier)

40

Thursday, November 29, 12 - Binary lazy enumeration

The case of 4 sentences

{f1, f2, f3, f4}

{f1, f2}

{f3, f4}

h21 h22

f1

f2

h11 69.1

h22

h

h12

11

enumerate the next best

combination (ExpandFrontier)

40

Thursday, November 29, 12 - Binary lazy enumeration

The case of 4 sentences

{f1, f2, f3, f4}

{f1, f2}

{f3, f4}

h21 h22

f1

f2

h11 69.1 69.2

E11+E22

h22

h

h12

11

enumerate the next best

combination (ExpandFrontier)

40

Thursday, November 29, 12 - Binary lazy enumeration

The case of 4 sentences

{f1, f2, f3, f4}

{f1, f2}

{f3, f4}

h21 h22

f1

f2

h21

h11 69.1 69.2

E11+E22

h12 69.3

h12

E12+E21

enumerate the next best

combination (ExpandFrontier)

40

Thursday, November 29, 12 - Binary lazy enumeration

The case of 4 sentences

{f1, f2, f3, f4}

frontier nodes

{f1, f2}

{f3, f4}

h21 h22

f1

f2

h21

h11 69.1 69.2

E11+E22

h12 69.3

h12

E12+E21

enumerate the next best

combination (ExpandFrontier)

40

Thursday, November 29, 12 - Binary lazy enumeration

The case of 4 sentences

{f1, f2, f3, f4}

{f1, f2}

{f3, f4}

h21 h22

f1

f2

h11 69.1

h

69.2

22

h

h12 69.3

11

choose the best frontier

41

Thursday, November 29, 12 - Binary lazy enumeration

The case of 4 sentences

{f1, f2, f3, f4}

{f1, f2}

{f3, f4}

h21 h22

f1

f2

h11 69.1

h

69.2

22

h

h12 69.3

11

choose the best frontier

combine & check the

combination is extreme 41

Thursday, November 29, 12 - Binary lazy enumeration

The case of 4 sentences

{f1, f2, f3, f4}

{f1, f2}

{f3, f4}

h21 h22

f1

f2

h11 69.1

h

69.2

22

h

h12 69.3

11

choose the best frontier

combine & check the

combination is extreme 41

Thursday, November 29, 12 - Binary lazy enumeration

The case of 4 sentences

{f1, f2, f3, f4}

{f1, f2}

{f3, f4}

h21 h22

f1

f2

h11 69.1

h

69.2

22

h

h12 69.3

11

choose the best frontier

combine & check the

combination is extreme 41

Thursday, November 29, 12 - Binary lazy enumeration

The case of 4 sentences

{f1, f2, f3, f4}

{f1, f2}

{f3, f4}

h21 h22

f1

f2

h11 69.1

h

69.2

22

h

h12 69.3

11

choose the best frontier

extreme

combine & check the

combination is extreme 41

Thursday, November 29, 12 - Binary lazy enumeration

The case of 4 sentences

{f1, f2, f3, f4}

{f1, f2}

{f3, f4}

h21 h22

f1

f2

h11 69.1

h

69.2

22

h

h12 69.3

11

choose the best frontier

extreme

combine & check the

combination is extreme 41

Thursday, November 29, 12 - Binary lazy enumeration

The case of 4 sentences

{h11, h22}

{f1, f2, f3, f4}

{f1, f2}

{f3, f4}

f1

f2

f3

f4

h22

h

h31

11

h41

42

Thursday, November 29, 12 - Binary lazy enumeration

The case of 4 sentences

{h11, h22}

{f1, f2, f3, f4}

{f1, f2}

{f3, f4}

f1

f2

f3

f4

h22

h

h31

11

h41

42

Thursday, November 29, 12 - Binary lazy enumeration

The case of 4 sentences

{h11, h22}

{f1, f2, f3, f4}

{f1, f2}

{f3, f4}

f1

h41 h42 f2

f3

f4

h22

h31h

h31

11

h32

h41

42

Thursday, November 29, 12 - Binary lazy enumeration

The case of 4 sentences

{h11, h22}

{f1, f2, f3, f4}

{f1, f2}

{f3, f4}

f1

h41 h42 f2

f3

f4

h22

h31h

h31

11

h32

h41

{h31, h41}

42

Thursday, November 29, 12 - Binary lazy enumeration

{h11, h22}

{f1, f2, f3, f4}

{h31, h41}

{f1, f2}

{f3, f4}

f1

f2

{h31, h41} f3

f4

h22

{h11, h22} 126

h

h31

11

h41

43

Thursday, November 29, 12 - Binary lazy enumeration

{h11, h22}

{f1, f2, f3, f4}

{h31, h41}

{f1, f2}

{f3, f4}

f1

f2

{h31, h41} f3

f4

h22

{h11, h22} 126

h

h31

11

h41

43

Thursday, November 29, 12 - Binary lazy enumeration

{h11, h22}

{f1, f2, f3, f4}

{h31, h41}

{f1, f2}

{f3, f4}

f1

f2

{h31, h41} f3

f4

h22

{h11, h22} 126

h

h31

11

h41

43

Thursday, November 29, 12 - Binary lazy enumeration

if this combination is

extreme, we’re done.

{h11, h22}

{f1, f2, f3, f4}

{h31, h41}

{f1, f2}

{f3, f4}

f1

f2

{h31, h41} f3

f4

h22

{h11, h22} 126

h

h31

11

h41

43

Thursday, November 29, 12 - Binary lazy enumeration

if this combination is

extreme, we’re done.

{h11, h22}

{f1, f2, f3, f4}

{h31, h41}

{f1, f2}

{f3, f4}

f1

f2

{h31, h41} f3

f4

h22

{h11, h22} 126

h

h31

11

h41

43

Thursday, November 29, 12 - Binary lazy enumeration

if this combination is

interior

{h11, h22}

{f1, f2, f3, f4}

{h31, h41}

{f1, f2}

{f3, f4}

f1

f2

{h31, h41} f3

f4

h22

{h11, h22} 126

h

h31

11

h41

44

Thursday, November 29, 12 - Binary lazy enumeration

if this combination is

interior

we have to search

lower branches

{f1, f2, f3, f4}

{f1, f2}

{f3, f4}

h f

21 1 h22

f2

{h31, h41} f3

h41 f4 h42

h22

h11 69.1 69.2

{h11, h22} 126

h31

h

h31

11

h12 69.3

h32

h41

44

Thursday, November 29, 12 - Binary lazy enumerationS=8

Split

{f1, f2, f3, f4, f5, f6, f7, f8}

{f1, f2, f3, f4}

{f5, f6, f7, f8}

{f1, f2}

{f3, f4}

{f5, f6}

{f7, f8}

f1

f2

f3

f4

f5

f6

f7

f8

divide-and-conquer computations can be executed

concurrently (not same as the paral elism! cf. http://vimeo.com/49718712)

45

Thursday, November 29, 12 - Approximations

• Practical y, LP-MERT is computational y

expensive

- there are too many LPs to be solved.

• Approximations #1: cos(w, w0) >= t

- w0: reasonable approximation of ŵ by 1D-MERT

• Approximations #2

- pruning the output of combined N-best lists by model

score wrt wbest (best model on the entire tuning set)

46

Thursday, November 29, 12 - Experiments

on machine translation

Thursday, November 29, 12 - Objectives

• compare LP-MERT with Och’s MERT (1D-MERT)

• evaluate exact LP-MERT

- effect of number of features on runtime

- scalability of the number of sentences.

• evaluate LP-MERT w/ beam search

- BLEU scores for dev set

- end-to-end MERT comparison

48

Thursday, November 29, 12 - Setup

• system: dependency treelet system (Quirk+, 2005)

- w/ cube pruning

• task: WMT 2010 English-to-German

- dev set: WMT 2009 test set

- test set: WMT 2010 test set

• features: 13 (use two language models)

• N-best list size is 100

• evaluation: sentence BLEU (Lin & Och, 2004)

49

Thursday, November 29, 12 - Setup (contd.)

• Baseline: Och’s MERT (“1D-MERT”)

- use random restarts and random walks (Moore &

Quirk, 2008)

- the values of the parameters they used in experiments

are not clear from this paper.

50

Thursday, November 29, 12 - Additional notes

“To enable legitimate comparisons, LP-

MERT and 1D-MERT are evaluated on the

same combined N-best lists, even though

running multiple iterations of MERT with

either LP-MERT or 1D-MERT would normally

produce different combined N-best lists.”

-(Galley and Quirk, 2011)

51

Thursday, November 29, 12 - LP-MERT vs. 1D-MERT

1000 overlapping subsets for S = 2, 4, and 8.

experimented on 374-best lists after 5 iterations of MERT

7

length

tested comb.

total comb.

order

6

8

639,960 1.33

S=8

⇥ 1020

O(N 8)

4

134,454

5

2.31

S=4

⇥ 1010

O(2N 4)

2

49,969

430,336 O(4N2)

LP-MERT 4

S=2

1

1,059

2,624

O(8N )

is better

3

LEU[%] 2

Table 1: Number of tested combinations for the experi-

B 1

ments of Fig. 5. LP-MERT with S = 8 checks only 600K

0

full combinations on average, much less than the total

-1

number of combinations (which is more than 1020).

0

100

200 300

400

500

600

700 800

900 1000

(Galley and Quirk, 2011)

1D-MER Figure

T

5:

Line graph of sorted differences in

NOTE: x-coordinate is sorted by ΔBLEU

is better BLEUn4r1[%] scores between LP-MERT and 1D-MERT

52

on 1000 tuning sets of size S = 2, 4, 8. The highest differ-

ences

Thursday, November 29, 12

for S = 2, 4, 8 are respectively 23.3, 19.7, 13.1.

German translation system. This consists of approx-

imately 1.6 million parallel sentences, along with a

much larger monolingual set of monolingual data.

We train two language models, one on the target side

of the training data (primarily parliamentary data),

and the other on the provided monolingual data (pri- Figure 6: Effect of the number of features (runtime on

marily news). The 2009 test set is used as develop- 1 CPU of a modern computer). Each curve represents a

ment data for MERT, and the 2010 one is used as test

different number of tuning sentences.

data. The resulting system has 13 distinct features.

5 Results

LP-MERT systematically finds the global optimum.

In the case S = 1, Powell rarely makes search er-

The section evaluates both the exact and beam ver- rors (about 15%), but the situation gets worse as S

sion of LP-MERT. Unless mentioned otherwise, the

increases. For S = 4, it makes search errors in 90%

number of features is

of the cases, despite using 20 random starting points.

D = 13 and the N -best list size

is 100. Translation performance is measured with

Some combination statistics for S up to 8 are

a sentence-level version of BLEU-4 (Lin and Och, shown in Tab. 1. The table shows the speedup pro-

2004), using one reference translation. To enable

vided by LP-MERT is very substantial when com-

legitimate comparisons, LP-MERT and 1D-MERT

pared to exhaustive enumeration. Note that this is

are evaluated on the same combined N-best lists, using D = 13, and that pruning is much more ef-

even though running multiple iterations of MERT

fective with less features, a fact that is confirmed in

with either LP-MERT or 1D-MERT would normally

Fig. 6. D = 13 makes it hard to use a large tuning

produce different combined N-best lists. We use

set, but the situation improves with D = 2 . . . 5.

WMT09 as tuning set, and WMT10 as test set. Be-

Fig. 7 displays execution times when LP-MERT

fore turning to large tuning sets, we first evaluate

constrains the output ˆ

w to satisfy cos(w0, ˆ

w)

t,

exact LP-MERT on data sizes that it can easily han- where t is on the x-axis of the figure. The figure

dle. Fig. 5 offers a comparison with 1D-MERT, for

shows that we can scale to 1000 sentences when

which we split the tuning set into 1,000 overlapping

(exactly) searching within the region defined by

subsets for S = 2, 4, 8 on a combined N-best after

cos(w0, ˆ

w)

.84. All these running times would

five iterations of MERT with an average of 374 trans- improve using parallel computing, since divide-and-

lation per sentence. The figure shows that LP-MERT

conquer algorithms are generally easy to parallelize.

never underperforms 1D-MERT in any of the 3,000

We also evaluate the beam version of LP-MERT,

experiments, and this almost certainly confirms that

which allows us to exploit tuning sets of reasonable

45 - Lazy vs. exhaustive enumeration

how many combinations are checked?

lazy

full

length

tested comb.

total comb.

order

8

639,960 1.33 ⇥ 1020

O(N 8)

4

134,454 2.31 ⇥ 1010 O(2N4)

2

49,969

430,336 O(4N2)

1

1,059

2,624

O(8N )

Table 1: Number of tested combinations

(G

for

alley an the

d Qu experi-

irk, 2011)

ments of Fig. 5. LP-MERT with S = 8 checks only 600K

full combinations on average, much less than the

N total

=374?

number of combinations (which is more than 1020).

53

Figure 5:

Line graph of sorted differences in

Thursday, November 29, 12

BLEUn4r1[%] scores between LP-MERT and 1D-MERT

on 1000 tuning sets of size S = 2, 4, 8. The highest differ-

ences for S = 2, 4, 8 are respectively 23.3, 19.7, 13.1.

German translation system. This consists of approx-

imately 1.6 million parallel sentences, along with a

much larger monolingual set of monolingual data.

We train two language models, one on the target side

of the training data (primarily parliamentary data),

and the other on the provided monolingual data (pri- Figure 6: Effect of the number of features (runtime on

marily news). The 2009 test set is used as develop- 1 CPU of a modern computer). Each curve represents a

ment data for MERT, and the 2010 one is used as test

different number of tuning sentences.

data. The resulting system has 13 distinct features.

5 Results

LP-MERT systematically finds the global optimum.

In the case S = 1, Powell rarely makes search er-

The section evaluates both the exact and beam ver- rors (about 15%), but the situation gets worse as S

sion of LP-MERT. Unless mentioned otherwise, the

increases. For S = 4, it makes search errors in 90%

number of features is

of the cases, despite using 20 random starting points.

D = 13 and the N -best list size

is 100. Translation performance is measured with

Some combination statistics for S up to 8 are

a sentence-level version of BLEU-4 (Lin and Och, shown in Tab. 1. The table shows the speedup pro-

2004), using one reference translation. To enable

vided by LP-MERT is very substantial when com-

legitimate comparisons, LP-MERT and 1D-MERT

pared to exhaustive enumeration. Note that this is

are evaluated on the same combined N-best lists, using D = 13, and that pruning is much more ef-

even though running multiple iterations of MERT

fective with less features, a fact that is confirmed in

with either LP-MERT or 1D-MERT would normally

Fig. 6. D = 13 makes it hard to use a large tuning

produce different combined N-best lists. We use

set, but the situation improves with D = 2 . . . 5.

WMT09 as tuning set, and WMT10 as test set. Be-

Fig. 7 displays execution times when LP-MERT

fore turning to large tuning sets, we first evaluate

constrains the output ˆ

w to satisfy cos(w0, ˆ

w)

t,

exact LP-MERT on data sizes that it can easily han- where t is on the x-axis of the figure. The figure

dle. Fig. 5 offers a comparison with 1D-MERT, for

shows that we can scale to 1000 sentences when

which we split the tuning set into 1,000 overlapping

(exactly) searching within the region defined by

subsets for S = 2, 4, 8 on a combined N-best after

cos(w0, ˆ

w)

.84. All these running times would

five iterations of MERT with an average of 374 trans- improve using parallel computing, since divide-and-

lation per sentence. The figure shows that LP-MERT

conquer algorithms are generally easy to parallelize.

never underperforms 1D-MERT in any of the 3,000

We also evaluate the beam version of LP-MERT,

experiments, and this almost certainly confirms that

which allows us to exploit tuning sets of reasonable

45 - length

tested comb.

total comb.

order

8

639,960 1.33 ⇥ 1020

O(N 8)

4

134,454 2.31 ⇥ 1010 O(2N4)

2

49,969

430,336 O(4N2)

1

1,059

2,624

O(8N )

Table 1: Number of tested combinations for the experi-

ments of Fig. 5. LP-MERT with S = 8 checks only 600K

Effect

full

of

combinations #

on av of

erage, dim.

much less than the total

number of combinations (which is more than 1020).

10,000

Figure 5:

Line graph of sorted differences in

1024

BLEUn4r1[%] scores between LP-MERT and 1D-MERT

256

on 1000 tuning sets of size

1,000

S = 2, 4, 8. The highest differ-

128

ences for S = 2, 4, 8 are respectively 23.3, 19.7, 13.1.

s

64

d

100

32

secon

16

German translation system. This consists of approx-

10

8

imately 1.6 million parallel sentences, along with a

4

much larger monolingual set of monolingualrdata.

1

2

untime on 1 CPU

We train two language models, one on the target side

2

3

4

5

6

7

8

9

1

dimension (D)

of the training data (primarily parliamentary data),

(Galley and Quirk, 2011)

and the other on the provided monolingual data (pri- Figure 6: Effect of the number of features (runtime on

marily news). The 2009 test set is used as develop- 1 CPU of a modern computer). Each curve represents a

54

ment data for MERT, and the 2010 one is used as test

dif

Thursday, November 29, 12

ferent number of tuning sentences.

data. The resulting system has 13 distinct features.

5 Results

LP-MERT systematically finds the global optimum.

In the case S = 1, Powell rarely makes search er-

The section evaluates both the exact and beam ver- rors (about 15%), but the situation gets worse as S

sion of LP-MERT. Unless mentioned otherwise, the

increases. For S = 4, it makes search errors in 90%

number of features is

of the cases, despite using 20 random starting points.

D = 13 and the N -best list size

is 100. Translation performance is measured with

Some combination statistics for S up to 8 are

a sentence-level version of BLEU-4 (Lin and Och, shown in Tab. 1. The table shows the speedup pro-

2004), using one reference translation. To enable

vided by LP-MERT is very substantial when com-

legitimate comparisons, LP-MERT and 1D-MERT

pared to exhaustive enumeration. Note that this is

are evaluated on the same combined N-best lists, using D = 13, and that pruning is much more ef-

even though running multiple iterations of MERT

fective with less features, a fact that is confirmed in

with either LP-MERT or 1D-MERT would normally

Fig. 6. D = 13 makes it hard to use a large tuning

produce different combined N-best lists. We use

set, but the situation improves with D = 2 . . . 5.

WMT09 as tuning set, and WMT10 as test set. Be-

Fig. 7 displays execution times when LP-MERT

fore turning to large tuning sets, we first evaluate

constrains the output ˆ

w to satisfy cos(w0, ˆ

w)

t,

exact LP-MERT on data sizes that it can easily han- where t is on the x-axis of the figure. The figure

dle. Fig. 5 offers a comparison with 1D-MERT, for

shows that we can scale to 1000 sentences when

which we split the tuning set into 1,000 overlapping

(exactly) searching within the region defined by

subsets for S = 2, 4, 8 on a combined N-best after

cos(w0, ˆ

w)

.84. All these running times would

five iterations of MERT with an average of 374 trans- improve using parallel computing, since divide-and-

lation per sentence. The figure shows that LP-MERT

conquer algorithms are generally easy to parallelize.

never underperforms 1D-MERT in any of the 3,000

We also evaluate the beam version of LP-MERT,

experiments, and this almost certainly confirms that

which allows us to exploit tuning sets of reasonable

45 - Empirical performance of Approximation #1

larger value of cosine (aggressively pruning), faster search

10,000

1024

to hypergraphs in future work. Exact search may be

512

challenging due to the computational complexity of

1,000

256

the search space (Leusch et al., 2008), but approxi-

128

mate search should be feasible.

s

d 100

64

Other research has explored alternate methods

32

secon

of gradient-free optimization, such as the downhill-

10

16

simplex algorithm (Nelder and Mead, 1965; Zens

8

et al., 2007; Zhao and Chen, 2009). Although the

4

1

runtime on 1 CPU

search space is different than that of Och’s algorithm,

0.99 0.98 0.96 0.92 0.84 0.68 0.36 -0.28

-1

2

1

it still relies on one-dimensional line searches to re-

cosine

(Galley and Quirk, 2011)

flect, expand, or contract the simplex. Therefore, it

Figure 7: Ef

c fect

os of

( a

w constraint on w (runtime on 1 CPU). suffers the same problems of one-dimensional MERT:

0, w)

t

55

feature sets with complex non-linear interactions are

Thursday, November 29, 12

32

64

128

256

512 1024

difficult to optimize. LP-MERT improves on these

1D-MERT 22.93 20.70 18.57 16.07 15.00 15.44

methods by searching over a larger subspace of pa-

our work

25.25 22.28 19.86 17.05 15.56 15.67

rameter combinations, not just those on a single line.

+2.32 +1.59 +1.29 +0.98 +0.56 +0.23

We can also change the objective function in a

Table 2: BLEUn4r1[%] scores for English-German on

number of ways to make it more amenable to op-

WMT09 for tuning sets ranging from 32 to 1024 sentences.

timization, leveraging knowledge from elsewhere

in the machine learning community. Instance re-

size. Results are displayed in Table 2. The gains

weighting as in boosting may lead to better param-

are fairly substantial, with gains of 0.5 BLEU point

eter inference (Duh and Kirchhoff, 2008). Smooth-

or more in all cases where

ing the objective function may allow differentiation

S 512.8 Finally, we

perform an end-to-end MERT comparison, where

and standard ML learning techniques (Och and Ney,

both our algorithm and 1D-MERT are iteratively used

2002). Smith and Eisner (2006) use a smoothed ob-

to generate weights that in turn yield new

jective along with deterministic annealing in hopes

N -best lists.

Tuning on 1024 sentences of WMT10, LP-MERT

of finding good directions and climbing past locally

converges after seven iterations, with a BLEU score

optimal points. Other papers use margin methods

of 16.21%; 1D-MERT converges after nine iterations, such as MIRA (Watanabe et al., 2007; Chiang et al.,

with a BLEU score of 15.97%. Test set performance

2008), updated somewhat to match the MT domain,

on the full WMT10 test set for LP-MERT and 1D- to perform incremental training of potentially large

MERT are respectively 17.08% and 16.91%.

numbers of features. However, in each of these cases

the objective function used for training no longer

6 Related Work

matches the final evaluation metric.

One-dimensional MERT has been very influential. It

7 Conclusions

is now used in a broad range of systems, and has been

improved in a number of ways. For instance, lattices

Our primary contribution is the first known exact

or hypergraphs may be used in place of N-best lists

search algorithm for direct loss minimization on N-

to form a more comprehensive view of the search

best lists in multiple dimensions. Additionally, we

space with fewer decoding runs (Macherey et al., present approximations that consistently outperform

2008; Kumar et al., 2009; Chatterjee and Cancedda, standard one-dimensional MERT on a competitive

2010). This particular refinement is orthogonal to our

machine translation system. While Och’s method of

approach, though. We expect to extend LP-MERT

MERT is generally quite successful, there are cases

where it does quite poorly. A more global search

8One interesting observation is that the performance of 1D-

such as LP-MERT lowers the expected risk of such

MERT degrades as S grows from 2 to 8 (Fig. 5), which contrasts

poor solutions. This is especially important for cur-

with the results shown in Tab. 2. This may have to do with the

fact that N-best lists with S = 2 have much fewer local maxima

rent machine translation systems that rely heavily on

than with S = 4, 8, in which case 20 restarts is generally enough.

MERT, but may also be valuable for other textual ap-

46 - Objectives

• compare LP-MERT with Och’s MERT (1D-MERT)

• evaluate exact LP-MERT

- effect of number of features on runtime

- scalability of the number of sentences.

• evaluate LP-MERT w/ beam search

- BLEU scores for dev set

- end-to-end MERT comparison

56

Thursday, November 29, 12 - to hypergraphs in future work. Exact search may be

challenging due to the computational complexity of

the search space (Leusch et al., 2008), but approxi-

mate search should be feasible.

Other research has explored alternate methods

of gradient-free optimization, such as the downhill-

simplex algorithm (Nelder and Mead, 1965; Zens

et al., 2007; Zhao and Chen, 2009). Although the

search space is different than that of Och’s algorithm,

Table 2

it still relies on one-dimensional line searches to re-

flect, expand, or contract the simplex. Therefore, it

Figure 7: Effect of a constraint on w (runtime on 1 CPU).

suffers the same problems of one-dimensional MERT:

feature sets with complex non-linear interactions are

32

64

128

256

512 1024

difficult to optimize. LP-MERT improves on these

1D-MERT 22.93 20.70 18.57 16.07 15.00 15.44

methods by searching over a larger subspace of pa-

our work

25.25 22.28 19.86 17.05 15.56 15.67

rameter combinations, not just those on a single line.

+2.32 +1.59 +1.29 +0.98 +0.56 +0.23

We can also change the objective function in a

Table 2: BLEUn4r1[%] scores for English-German on

number of ways to make it more amenable to op-

WMT09 for tuning sets ranging from 32 to 1024 sentences.

timization, leveraging knowledge from elsewhere

(Galley and Quirk, 2011) in the machine learning community. Instance re-

size. Results are displayed in Table 2. The gains

weighting as in boosting may lead to better param-

are fairly substantial, with gains of 0.5 BLEU point

eter inference (Duh and Kirchhoff, 2008). Smooth-

or more in all cases where

ing the objective function may allow differentiation

S 512.8 Finally, we

and standard ML learning techniques (Och and Ney,

57

perform an end-to-end MERT comparison, where

both

Thursday, November 29, 12

our algorithm and 1D-MERT are iteratively used

2002). Smith and Eisner (2006) use a smoothed ob-

to generate weights that in turn yield new

jective along with deterministic annealing in hopes

N -best lists.

Tuning on 1024 sentences of WMT10, LP-MERT

of finding good directions and climbing past locally

converges after seven iterations, with a BLEU score

optimal points. Other papers use margin methods

of 16.21%; 1D-MERT converges after nine iterations, such as MIRA (Watanabe et al., 2007; Chiang et al.,

with a BLEU score of 15.97%. Test set performance

2008), updated somewhat to match the MT domain,

on the full WMT10 test set for LP-MERT and 1D- to perform incremental training of potentially large

MERT are respectively 17.08% and 16.91%.

numbers of features. However, in each of these cases

the objective function used for training no longer

6 Related Work

matches the final evaluation metric.

One-dimensional MERT has been very influential. It

7 Conclusions

is now used in a broad range of systems, and has been

improved in a number of ways. For instance, lattices

Our primary contribution is the first known exact

or hypergraphs may be used in place of N-best lists

search algorithm for direct loss minimization on N-

to form a more comprehensive view of the search

best lists in multiple dimensions. Additionally, we

space with fewer decoding runs (Macherey et al., present approximations that consistently outperform

2008; Kumar et al., 2009; Chatterjee and Cancedda, standard one-dimensional MERT on a competitive

2010). This particular refinement is orthogonal to our

machine translation system. While Och’s method of

approach, though. We expect to extend LP-MERT

MERT is generally quite successful, there are cases

where it does quite poorly. A more global search

8One interesting observation is that the performance of 1D-

such as LP-MERT lowers the expected risk of such

MERT degrades as S grows from 2 to 8 (Fig. 5), which contrasts

poor solutions. This is especially important for cur-

with the results shown in Tab. 2. This may have to do with the

fact that N-best lists with S = 2 have much fewer local maxima

rent machine translation systems that rely heavily on

than with S = 4, 8, in which case 20 restarts is generally enough.

MERT, but may also be valuable for other textual ap-

46 - to hypergraphs in future work. Exact search may be

challenging due to the computational complexity of

the search space (Leusch et al., 2008), but approxi-

mate search should be feasible.

Other research has explored alternate methods

of gradient-free optimization, such as the downhill-

simplex algorithm (Nelder and Mead, 1965; Zens

et al., 2007; Zhao and Chen, 2009). Although the

search space is different than that of Och’s algorithm,

Table 2

it still relies on one-dimensional line searches to re-

gains of over 0.5 BLEU point

flect, expand, or contract the simplex. Therefore, it

Figure 7: Effect of a constraint on w (runtime on 1 CPU).

suffers the same problems of one-dimensional MERT:

feature sets with complex non-linear interactions are

32

64

128

256

512 1024

difficult to optimize. LP-MERT improves on these

1D-MERT 22.93 20.70 18.57 16.07 15.00 15.44

methods by searching over a larger subspace of pa-

our work

25.25 22.28 19.86 17.05 15.56 15.67

rameter combinations, not just those on a single line.

+2.32 +1.59 +1.29 +0.98 +0.56 +0.23

We can also change the objective function in a

Table 2: BLEUn4r1[%] scores for English-German on

number of ways to make it more amenable to op-

WMT09 for tuning sets ranging from 32 to 1024 sentences.

timization, leveraging knowledge from elsewhere

(Galley and Quirk, 2011) in the machine learning community. Instance re-

size. Results are displayed in Table 2. The gains

weighting as in boosting may lead to better param-

are fairly substantial, with gains of 0.5 BLEU point

eter inference (Duh and Kirchhoff, 2008). Smooth-

or more in all cases where

ing the objective function may allow differentiation

S 512.8 Finally, we

and standard ML learning techniques (Och and Ney,

57

perform an end-to-end MERT comparison, where

both

Thursday, November 29, 12

our algorithm and 1D-MERT are iteratively used

2002). Smith and Eisner (2006) use a smoothed ob-

to generate weights that in turn yield new

jective along with deterministic annealing in hopes

N -best lists.

Tuning on 1024 sentences of WMT10, LP-MERT

of finding good directions and climbing past locally

converges after seven iterations, with a BLEU score

optimal points. Other papers use margin methods

of 16.21%; 1D-MERT converges after nine iterations, such as MIRA (Watanabe et al., 2007; Chiang et al.,

with a BLEU score of 15.97%. Test set performance

2008), updated somewhat to match the MT domain,

on the full WMT10 test set for LP-MERT and 1D- to perform incremental training of potentially large

MERT are respectively 17.08% and 16.91%.

numbers of features. However, in each of these cases

the objective function used for training no longer

6 Related Work

matches the final evaluation metric.

One-dimensional MERT has been very influential. It

7 Conclusions

is now used in a broad range of systems, and has been

improved in a number of ways. For instance, lattices

Our primary contribution is the first known exact

or hypergraphs may be used in place of N-best lists

search algorithm for direct loss minimization on N-

to form a more comprehensive view of the search

best lists in multiple dimensions. Additionally, we

space with fewer decoding runs (Macherey et al., present approximations that consistently outperform

2008; Kumar et al., 2009; Chatterjee and Cancedda, standard one-dimensional MERT on a competitive

2010). This particular refinement is orthogonal to our

machine translation system. While Och’s method of

approach, though. We expect to extend LP-MERT

MERT is generally quite successful, there are cases

where it does quite poorly. A more global search

8One interesting observation is that the performance of 1D-

such as LP-MERT lowers the expected risk of such

MERT degrades as S grows from 2 to 8 (Fig. 5), which contrasts

poor solutions. This is especially important for cur-

with the results shown in Tab. 2. This may have to do with the

fact that N-best lists with S = 2 have much fewer local maxima

rent machine translation systems that rely heavily on

than with S = 4, 8, in which case 20 restarts is generally enough.

MERT, but may also be valuable for other textual ap-

46 - End-to-end comparison

iterations of WMT 2010 WMT 2010

Method

outer loop

devset

test set

1D-MERT

9

15.97

16.91

LP-MERT

7

16.21

17.08

+0.24

+0.17

This table is created from the results of (Galley and

Quirk, 2011)

58

Thursday, November 29, 12 - Summary

• Och’s MERT remains inexact in the multi-

dimensional search

• proposed LP-MERT, an exact search algorithm

on N-best lists in the multi-dimensional case.

- give us the optimal parameters as ground truth

- In practice, two approximations are offered

• Experimental results: slightly improves over

Och’s MERT in terms of sentence BLEU.

59

Thursday, November 29, 12