このページは http://www.slideshare.net/naist-mtstudy/mt-study-20151015-miura の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

byNAIST Machine Translation Study Group

約1年前 (2015/10/15)にアップロードinテクノロジー

Paper Introduction,

"Supervised Phrase Table Triangulation with Neural Word Embeddings for Low-R...

Paper Introduction,

"Supervised Phrase Table Triangulation with Neural Word Embeddings for Low-Resource Languages"

Tomer Levinboim and David Chiang

- MT Study Group

Supervised Phrase Table Triangula8on

with Neural Word Embeddings

for Low- ‐Resource Languages

Tomer Levinboim and David Chiang

Proc. of EMNLP 2015, Lisbon, Portugal

Introduced by Akiva Miura, AHC- ‐Lab

15/10/15

2015©Akiva Miura AHC- ‐Lab, IS, NAIST

1 - Contents

1. Introduc8on

2. Preliminaries

3. Supervised Word Transla8ons

4. Experiments

5. Conclusion

6. Impression

15/10/15

2015©Akiva Miura AHC- ‐Lab, IS, NAIST

2 - 1. Introduc8on

Problem: Scarceness of Bilingual Data

l PBMT systems require considerable amounts of source- ‐target

parallel data to produce good quality transla8on

Ø A triangulated source- ‐target phrase table can be composed from

a source- ‐pivot and pivot- ‐target phrase table, but s8l noisy

l This paper shows a supervised learning technique that improves

noisy phrase transla;on scores by extrac8on of word transla8on

distribu8ons from small amounts of bilingual data

Ø This method gained improvement on Malagasy- ‐to- ‐French and

Spanish- ‐to- ‐French transla8on tasks via English

15/10/15

2015©Akiva Miura AHC- ‐Lab, IS, NAIST

3 - 2. Preliminaries

2 Preliminaries

3 Supervised Word Translations

2 Preliminaries

3 Supervised Word Translations

2 Pr2 Preliminaries

3 Supervised Word Translations

De

eliminaries nota8on:

Let s

3 Supervised Word Translations

, p, t

2

denote Pr

2w eliminaries

ords 2

and Pr

s, eliminaries

p, t denote phrases

While

3

interpolation Super

(Eq. 3) vised

may Word

help Translations

Preliminaries

correct some

Let s

3 3 Super

Super

vised

vised W W

ord ord

T

Translations

, p, t denote

ranslations

- words i w

n ords

sourceand

, pivo s

t, , p

an,d t t denote

arget lang phrases

uages respect While

ively

interpolation (Eq. 3) may help correct some

in the source, pivot, and

Let tar

s, get

p, t

Let languages,

denote

s, p, t words respec-

denoteand

w s, p

ords , t

and ofs, the

denote

p, t noisy

phrases

denote triangulated

While

phrases

scores,

interpolation

While

its

(Eq.e↵

3) ect

interpolation is

may lim-

help

(Eq. correct

3) may some

help correct some

Let s,

Let p,s tpdenote

Let s

, t

w

,

denote ords

p, t

w

in

ords the

and s

denote, source,

p,

w, tords and s, p, t denote phrases While interpolation (Eq. 3) may help correct some

t - ph pi

ra

denote v

denoteseot,

s in and

phrases

…

phrases target

While languages,

While

interpolation respec-

interpolation

(Eq. (Eq.

3)

3)

may

of

help the

may

noisy

help

correct some triangulated

correct some

scores, its e↵ect is lim-

tively. Also, let

in the source,

in the pivot, and

source, pi tar

v get

ot,

languages,

and target

respec-

languages, of the

respec- noisy

of triangulated

the noisy

scores, its

triangulated e↵ect is

scores, lim-

its e↵ect is lim-

inTthe

in source,

the

denote inapi

source, vot,

the

pivot,

phrase ti

and

and v

source, ely

tarpi

tar

table .

get

v

get Also,

ot, and let

tar

languages, T

languages,

estimated denote

respec-

get

respec-ited a

of phrase

of

languages,

the

to the table

noisy

respec-

noisy

phrase

estimated

triangulated

of

pairs the

triangulated noisy

scores, ited

scores,

its

appearingein both phrase ta-

↵ to

its

ect phrase

e

is ↵ect

triangulated

lim-

pairs

is lim-

scores, appearing

its e↵ect is in both

lim-

phrase ta-

over a parallel

tively. Also,

tively let

.

T denote

Also, let T a phrase

denote table

a

estimated

phrase table

ited to

estimated phrase

ited pairs

to

appearing

phrase pairs in both phrase

appearing in ta-

both phrase ta-

tively

ti .

v Also,

ely.

corpus let

tiv

Also,

and T

ely

let .T

ˆT

o

denote

Also,

denotever

alet

denote a

Ta parallel

phrase-

phrase table

denote

a ph

table a

ra

triangu-corpus

estimated

phrase

se table est

estimated and

table

imated

ited

bles.

ˆT

ited

o

to denote

to

estimated

ver a paral

phrase

Here, we a

phraselel

pairs triangu-

pairs

ited

corp to

us

appearing

suggest a in bles.

appearing

phrase in

both Here,

both

pairs

phrase

discriminative we

ta- suggest

phrase

appearing ta-

in

a

supervised discriminati

both phrase ta- ve supervised

over a parallel

over a corpus

parallel and ˆT

corpus denote

and ˆT a triangu-

denote a

bles. Here,

triangu-

we suggest

bles. Here, a discriminati

we suggest a ve supervised

discriminative supervised

overovaer parallel

a

ov corpus

er

parallel acorpus

lated phrase table. We use similar lated

and

parallel

and phrase

ˆTcorpus

T

- a t

notation for table.

denote

denoteri a a

and

ngula

their W

ˆTte e

triangu-

d use

triangu-

phra similar

denotese bles.

a

bles.

t ble

learning notation

Here,

triangu-

Here, we we

method for

suggest a

that their

suggest

bles.

a

Here,

can a↵ learning

discriminati

we

discriminatisuggest

ve

ect all

method

vea

supervised

phrase

that

supervised

pairs. can

discriminative asupervised

↵ect all phrase pairs.

lated phrase

latedlatedtable.

lated

phrase

W

table.

phrase

lated e

We

table. W

phrase e use

table. similar

We

notation

use similar for their

notation

learning

for their

method that

learning can a↵

method ect all

that phrase

can a↵ pairs.

ect all phrase pairs.

respecti

use

phrase table.

use similarv

similar e

We phrase

notation

use

notation for translation

for their

similar notation

their

features

learning

for

learning their

method ,

that le

method xical-

that

can a can

learning

↵ect a↵

all

Our

ect

methodall

that

phrase idea

phrase

can

pairs. ais↵ to

pairs.

ect re

all gard w

phrase ord translation

pairs.

distri-

respective phrase translation

respective features

phrase

respective

, le-

phrase xical-

phrase translatio

translation features

translation Our

n featu

, le idea

res

xical-

features , isle to re

xical-g

Our ard

idea word

is to

Our translation

regard

idea is word

to re distri-

respective

respecti phrase

verespecti

phrase ve

translation features , lexical-

Our idea is to regard word translation distri-

translation

gard word distri-

translation distri-

weighting

translation

phrase

features

features , le

translation lex,

xical- and

features ,the

le w

Our ord

idea

xical- translation

is to reg

Our ard w

idea bisutions

ordto re deri

gard ved

translation

w from

distri-

ord

source-tar

translation

get

distri- bilingual data

weighting

weighting

features lex, features

weightingand

weighting le

features

the x

weighting

le ,x, and the

weighting lex word

- lexi translation

cal-weighting f b

e utions

atures derived from source-target bilingual data

word

features translation

, and

features the

lex, w

and butions

ordthe w deri

translation

ord vedbfrom

utions

translation source-tar

deri

b ved

utions get

from

deri bilingual

source-tar

ved from data

get bilingual

source-tar data

get bilingual data

probabilities w.

probabilities

and the

features w

lex, w

ord .translation

and the word butions deri

translation vedb from

utions deriv

(through word alignments or

(through

source-tar

ed get

from

dictionary

word

bilingual

entries)

alignments

data

source-target

or

bilingual dictionary

data

entries)

probabilities w.probabilities w.

probabilities

probabilities w. w.

probabilities w. - word translation prob (through

(through

abiliti w

es

w

ord ord alignments

(through

(through w w

alignments or

ord

dictionary

alignments

(through word or entries)

dictionary

alignments or entries)

dictionary entries)

as the correct translation distrib as

or

ord the

utions,

correct

dictionary

alignments

and use

translation

entries)

or dictionary distributions,

entries)

and use

2.1 Triangulation 2.1

(weak Triangulation

baseline)

(weak baseline)

as the correct

as translation

as the

the correct

as

distrib

correct

the

utions,

translation

translation

correct

and use

distrib

distrib utions,

utions,

translation

and

and

distrib use

use

utions, and use

2.1 Triangulation

2.1 2.1

T2.1 T (weak

T

baseline)

as them

the to learn discriminately:

correct translation them

correct tar

distrib to

get learn

words

utions, discriminately:

and use

correct target words

riangulation

riangulation

(weak

(weak

riangulation

baseline)

baseline)

(weak

them

baseline) to learn discriminately:

them to learn

them correct

to learn target

discriminately: words

2.1 T

In

riangulation phrase table

(weak

triangulation,

baseline)

a source-target2015©Ashould

kiva Miura A become

HC- ‐Lab, IS, NAIST lik them

ely

to learn

translations, discriminately:

and incorrect

correct

correct tar

tar

discriminately: get

get w

w ords

ords

correct target words

15/10/15

4

In phrase table triangulation,

them to

a

learn source-target

discriminately:should become

correct target lik

w ely

ordstranslations, and incorrect

In phrase

In table

phrase triangulation,

In

table

a source-tar

phrase triangulation,

table

a get

triangulation, should

source-tar

a

get become

s - 2 Preliminaries

3 Supervised Word Translations

Let s, p, t denote words and s, p, t denote phrases

While interpolation (Eq. 3) may help correct some

in the source, pivot, and target languages, respec-

of the noisy triangulated scores, its e↵ect is lim-

tively. Also, let T denote a phrase table estimated

ited to phrase pairs appearing in both phrase ta-

2

overPr

a eliminaries

parallel corpus and ˆT denote a triangu-

3 Super

bles. Here,vised

we

Word

suggest a Translations

discriminative supervised

Let s

lated phrase table. We use similar notation for their

learning method that can a

all phrase pairs.

, p, t denote words and s, p, t denote phrases

While interpolation (Eq. 3) may

↵ect help correct some

in the source,

respective

piv

phraseot, and target

translation languages,

features , respec-

lexical-

of the

Our noisy

idea triangulated

is to regard scores,

word its e

translation distri-

↵ect is lim-

tively. Also,

weighting

let T denote

features lex, a phrase

and the table

word estimated

translation

ited

b

to

utionsphrase

deriv pairs

ed

appearing

from

in

source-tar both

get

phrase

bilingual ta-

data

probabilities w.

(through word alignments or dictionary entries)

2 Preliminaries

over a parallel corpus and ˆT 2 Pr

denote a 3 Super

eliminaries

triangu-

vised

bles.

as the Word

Here, we

correct Translations

suggest a

3 Super

discriminative vised Word

supervised Translations

lated phrase table. We use similar notation for their

learning method that can a

translation distributions, and use

↵ect all phrase pairs.

2.1 Triangulation (weak baseline)

them to learn discriminately: correct target words

Let s

Let s, p, t denote words and s, p, t denote phrases

While interpolation (Eq. 3) may help correct some

, p

2 , t denote

Pr

words

eliminaries

respectiand

ve s, p,

phrase t denote phrases

translation

3

features While

Super

, le

interpolation

vised

xical- Word T

Our

(Eq.

idea is 3)

ranslations

to may

reg help

ard w correct

ord

some

translation distri-

In phrase table triangulation, ina the source, pi

source-tar vot,

get and target

should languages,

become

respec-

likely

of the noisy

translations, and triangulated

incorrect

scores, its e↵ect is lim-

in the source,

Let s, p, t pivot,

denote and

words

phrase tar

weighting get

and s,

table T languages,

p, t denote

features lex, and respec-

phrases

the word translation

butions derived from source-target bilingual data

st is constructed tively

by . of

While

Also, the

let

combining noisy

interpolation

Ta

triangulated

(Eq. 3) may help

denote a

onesphrase table

should be scores,

correct

estimated

do

its

some e↵

wn-weighted. ect

ited to

To is lim-

phrase pairs

generalize appearing

be-

in both phrase ta-

tively. Also,

in the let T

source, denote

pivot, and

source-pi a

v phrase

target

probabilities w.

ot and pi table

languages,

vot-tar estimated

respec-

of the

(through word alignments or dictionary entries)

get

over

phrase a ited

table Tsp, to

noisy

parallelTpt phrase

corpus

,

pairs

triangulated

and ˆ

yond T

the v appearing

scores, its e↵ect is

denote

ocab a

ulary in

triangu-

of

both

lim-

the

phrase

bles. Here,

source-targetta-

we suggest

data, wea discriminative supervised

tively. Also, let T denote a phrase table estimated

ited to phrase pairs appearing

as the

in both

correct

phrase ta-

translation distributions, and use

over a

each estimated on its respective lated phrase

parallel

table.

data. ForWe use similar

appeal to notation

word

for their

embeddings. learning method that can a

o parallel

ver a

corpus

parallel2.1 T and ˆ

T

corpus and ˆT denote

denote

riangulation a a

(weak triangu-bles.

baseline)

bles.

Here, Here,

we

we

suggest a suggest a

discriminatidiscriminati

them to learn ve supervised ve supervised

triangu-

↵ect all phrase pairs.

discriminately: correct target words

lated phrase

each resulting phrase pair (s, t), respecti

we can ve phrase

also com- translation

We features

present ,

ourlexical-

Our

formulation in idea

the is to regard

source-to-

word translation distri-

lated

table.

phrase

W

table.

In e use

We

similar

use similar

phrase table notation

notation forfor their

their

triangulation, a

learning

learning method

source-target method

that can a

shouldthat

↵ect can

all

become a↵

likect

phraseely all

pairs. phrase pairs.

pute an alignment ˆ

the

weighting

most frequentfeatures

align- lex, and the word translation

butions

translations, deri

and ved from

incorrectsource-target bilingual data

respective phrase

respective

translation

phrase translation

phrase table features

features

T

a as,

target direction. The target-to-source direction is

st is

,le lexical-

xical-

Our

constructed by

Our

idea is

combining idea

to reg

a

is

ard

ones to

w re

ord g

should ard

be w

do ord

translation

translation

distri-

wn-weighted. To distri-

probabilities w.

(through word alignments

generalize be-

or dictionary entries)

ment obtained by combining source-pivot and

obtained simply by swapping the source and tar-

weighting features

weighting

le

features x,lexand

, and

source-pivot the

the

and w

wpi ord

ord

v

translation

translation

b

ot-target phrase

b

utions

table utions

derived

Tsp, Tpt, deriv

from ed

yond from

source-tar

the v source-tar

get bilingual

ocabulary of get

data

the bilingual

as the

source-tar data

get data, we

probabilities w.pivot-target alignments asp and apt (through

across all word

pivot alignments

get

or dictionary

languages.

entries)

correct translation distributions, and use

probabilities w. each estimated on its respective 2.1 Triangulation

parallel data. For (weak baseline)

appeal to word embeddings.

phrases p as follows: {(s, t) | 9p : as

(s, p)(through

the correct

word alignments

2 asp ^translation distributions, or dictionary

and use them entries)

to learn discriminately: correct target words

2.1 Triangulation

each (weak baseline)

resulting phrase pair (s, t), In

we can also com-

(p, t)

them

2 a

3.1 Model

pt}.

as

to

phrase the

learn

table correctW translation

discriminately:

e

correct

triangulation, a

present tar

ourdistrib

get w

source-target utions,

ords

formulation in and

shouldthe use

become likely

source-to- translations, and incorrect

2.1 Triangulation

In phrase table (weak

pute an

The

baseline)

triangulation, a

alignment source-tar

ˆa as the get phrase

most

table

frequent T

align-

target direction. The target-to-source direction is

triangulated

source-to-target them

should le to

become

st is constructed by combining a

xical learn

likely

Letdiscriminately:

translations, and

csup

st denote the correct

incorrect

number tar

of get

ones

times words

should be down-weighted. To generalize be-

ment obtained by combining

source word

source-piv

source-pi ot

v and pi

and vot-target phrase

obtained

table

simply T

bysp, T

sw pt,

yond

apping the the vocab

source ulary

and

of

tar- the source-target data, we

In phrase

phrase table

table Tst is

c

pivot-tar

weights, 2

get.1 T

constructed r

triangulation,

denoted ian

a

bylex g

alignments

st, uala8

are o

combining na (

source-tar we

get

each ak

ones b

approximated as

in e

should

estimated l

should

tw i

be n

ono e

do

its )

become

s was aligned to target word t (in word alignment,

get

lik

respectiveely

wn-weighted. T translations,

o generalize be-

parallel

languages.

data. For

and incorrect

source-pivot and pivot-target phrase table Tsp and apt across all pivot

appeal to word embeddings.

sp, Tpt,

yond the vocabulary of the source-target data, we

phrase table

steps: First, word translation scores ˆw are ap-

or in the dictionary). We define the word transla-

each

Tst is

estimated onconstructed

its respecti

phrases p as ve by

follows: combining

parallel{(data.

s, t) F

| or

9p : a

each

(s, p)stones

appeal to

2 w

resulting

a should

ord

phrase

sp ^

be

pair ( do

embeddings.

s, t wn-weighted.

), we can also

T

com- o generalize

We

be-

present our formulation in the source-to-

source-piv

each ot

proximated by marginalizing over the pivot words:

3.1

tion Model

distributions wsup(t | s) = - 2 Preliminaries

3 Supervised Word Translations

Let s, p, t denote words and s, p, t denote phrases

While interpolation (Eq. 3) may help correct some

in the source, pivot, and target languages, respec-

of the noisy triangulated scores, its e↵ect is lim-

tively. Also, let T denote a phrase table estimated

ited to phrase pairs appearing in both phrase ta-

over a parallel corpus and ˆT denote a triangu-

bles. Here, we suggest a discriminative supervised

lated phrase table. We use similar notation for their

learning method that can a↵ect all phrase pairs.

respective phrase translation features , lexical-

Our idea is to regard word translation distri-

weighting features lex, and the word translation

butions derived from source-target bilingual data

probabilities w.

(through word alignments or dictionary entries)

as the correct translation distributions, and use

2.1 Triangulation (weak baseline)

them to learn discriminately: correct target words

In phrase table triangulation, a source-target

should become likely translations, and incorrect

phrase table Tst is constructed by combining a

ones should be down-weighted. To generalize be-

source-pivot and pivot-target phrase table Tsp, Tpt,

yond the vocabulary of the source-target data, we

each estimated on its respective parallel data. For

appeal to word embeddings.

each resulting phrase pair (s, t), we can also com-

We present our formulation in the source-to-

pute an alignment ˆa as the most frequent align-

target direction. The target-to-source direction is

ment obtained by combining source-pivot and

obtained simply by swapping the source and tar-

pivot-target alignments asp and apt across all pivot

get languages.

phrases p as follows: {(s, t) | 9p : (s, p) 2 asp ^

(p, t) 2 a

3.1 Model

pt}.

The

triangulated

source-to-target

lexical

Let csup

st denote the number of times source word

weights, denoted c

lexst, are approximated in two

s was aligned to target word t (in word alignment,

steps: First, word translation scores ˆwst are ap-

or in the dictionary). We define the word transla-

proximated by marginalizing over the pivot words:

tion distributions wsup(t | s) = csup

st /csup

s , where

P

X

csup

s

=

t csup

st . Furthermore, let q(t | s) denote the

ˆwst(t | s) =

wsp(p | s) · wpt(t | p).

(1)

word translation probabilities we wish to learn and

p

consider maximizing the log-likelihood function:

Next, given a (triangulated) phrase pair (s, t) with

X

arg max L(q) = arg max

csup

alignment ˆa, let ˆa

st log q(t | s).

s,: = {t | (s, t) 2 ˆa}; the lexical-

q

q

(s,t)

weighting probability is (Koehn et al., 2003):

Clearly, the solution q(

Y

· | s) := wsup(· | s) maxi-

1 X

c

lex

mizes L. However, we would like a solution that

st(t | s, ˆa) =

ˆwst(t | s).

(2)

s

|ˆa

2s

s,:| t2ˆa

generalizes to source words s beyond those ob-

s,:

served in the source-target corpus – in particular,

The triangulated phrase translation scores, de-

those source words that appear in the triangulated

noted ˆst, are computed by analogy with Eq. 1.

phrase table ˆT, but not in T.

We also compute these scores in the reverse

In order to generalize, we abstract from words

direction by swapping the source and target lan-

to vector representations of words. Specifically,

guages.

we constrain q to the following parameterization:

2.2 Inter2.2 Inte

polation rpo

(str la8

ong on (stro

baseline) ng baseline)

1

⇣

⌘

q(t | s) =

exp vT

Given access to source-target data, an ordinary

Z

s Avt + f T

st h

s

X

⇣

⌘

l Given acces

source-tar s to

get source- ‐

phrase target

table data,

T

an ordinary source- ‐target phrase

st can be estimated di-

Zs =

exp vT

.

table T

s Avt + f T

st h

st can be es8mated directly

rectly. Wu and Wang (2007) suggest interpolating

t2T (s)

phrase pairs entries that occur in both tables:

l Interpola8on of phrase pairs entries that occur in both tables: Here, the vectors vs and vt represent monolingual

T

features and the vector f

interp = ↵Tst + (1

↵) ˆ

Tst.

(3)

st represents bilingual fea-

tures. The parameters A and h are to be learned.

Phrase pairs appearing in only one phrase table are added as- ‐is

Phrase pairs appearing in only one phrase table are

In this work, we use monolingual word embed-

added as-is. We refer to the resulting table as the

dings for vs and vt, and set the vector fst to con-

interpolated phrase table.

tain only the value of the triangulated score, such

1080

15/10/15

2015©Akiva Miura AHC- ‐Lab, IS, NAIST

6 - 3. Supervised Word Transla8on

l The eﬀect of interpola8on (Eq. 3) is limited to phrase pairs appearing

in both phrase tables.

l The idea of this paper is to regard word transla8on distribu8ons

derived from source- ‐target bilingual data (through word alignments

or dic8onary entries) as the correct transla8on, and use them to

learn discriminately

• correct target words should become likely transla;ons

• incorrect ones should be down- ‐weighted

Ø To generalize beyond the vocabulary of the source- ‐target data, the

authors appeal to word embeddings

15/10/15

2015©Akiva Miura AHC- ‐Lab, IS, NAIST

7 - 2 Pr

2 Pr

eliminarieseliminaries

3 Super 3

vised Super

Word vised

T

Word

ranslations Translations

Let s, p, t

Let

denotes, p

w , t denote

ords and s, w

p ords

, t

and

denote s, p, t denote

phrases

phrases

While

While

interpolation interpolation

(Eq. 3) may

(Eq.

help

3) may

correct

help

some correct some

in the

in

source, the

piv source,

ot, and

pi

tar vot,

get

and target

languages,

languages,

respec-

of respec-

the noisy of the noisy

triangulated triangulated

scores, its e↵ scores,

ect is

its

lim- e↵ect is lim-

2

2

Pr Preliminaries

eliminaries

3

3

SuperSuper

vised vised

Word W

T ord Translations

ranslations

tively. Also, tiv

letely

T . Also,

denote let

a T denote

phrase

a

table phrase table

estimated

estimated

ited to phraseited to

pairs phrase pairs

appearing in appearing

both phrase in both

ta-

phrase ta-

2

o

Pr ver a

eliminaries

3 Supervised Word Translations

over

parallel

a parallel

corpus and corpus

ˆTLet s, p

2 Pr and

denote

Let,t a

s, p, ˆTt

denotedenote

triangu-

denote

eliminarieswordsw a

and triangu-

bles.

ordss, pHere,

and

, t s, p we

, t

denote bles.

denote Here,

suggest

phrases3 a we

phrases

Supersuggest

discriminati

While While

vised v

W ae discriminati

supervised

interpolation

interpolation (Eq. 3)

ord T

v

(Eq.

may

ranslationse supervised

3) may

help

help

correct correct

some some

Let s, lated

p, t phrase

in

in

the the source,

source, pivot, pi

While vot,

and and

target target

interpolation

languages,

(Eq. 3)

languages, may respec-

respec-

of the of the

noisy noisy triangulated

triangulated scores, scores,

its e↵ its

ect e

is ↵ect

lim- is lim-

denote w lated

table.

ords phrase

W

and es,use

p, ttable. W

similar

denote

Let se, use

p, t similar

notation

phrases for

denote w notation

their

ords and s, for

p, t their

learning

denote

learning

method that

help

phrases

method

can a↵ect

correct

While

that

all

some

interpolationcan

(Eq. a

phrase pairs.

↵

3) ect

may all phrase

help correct pairs.

some

in the respecti

source, ve

tivelyti

. vely.

Also, Also,

of

let T let T

the

denote

noisy

denote a

a phrase

triangulated

phrase table table estimated

estimated

ited to ited to

phrase phrase

pairs pairs appearing

appearing in both in both

phrase phrase

ta-

ta-

pivot, and target languages, respec-

scores, its e

respecti

phrase

ve phrase

translation oin translation

features

the source,,

↵ect is lim-

ver oaver a

parallel

pivfeatures

lexical-

parallel

corpus

ot, and tar ,

get languages, respec-

of the noisy triangulated scores, its e↵ect is lim-

ited to corpus

and le

ˆT xical-

Our idea is

phrase and ˆT

pairs

denote

denote a

Our

to reg

appearing a

triangu- idea

ard

triangu- is

w

bles. to

ord re

bles.

Here, g

weard

Here, w

translationord

we

suggest a translation

distri-

suggest a

discriminative distri-

tively.weighting

discriminative supervised

Also, let T denote a phrase table estimated

in both phrase ta-

supervised

tively. Also, let T denote a phrase table estimated

ited to phrase pairs appearing in both phrase ta-

over a parallel

weighting

features lex,

corpus and ˆT features

and the

denote a le

w

lated x, and

ord

lated

triangu-

phrase the

phrase

table. w

bles.Word

translation

table.

Here,

e use translation

butions deriv

We use

we

similar

suggest

similar

a

notation butions

ed from

notation

for

for

discriminati

their deri

ve v

their ed from

source-target

supervised

learning

source-tar

bilingual

learning method

method that can get

data

that

a↵ bilingual

can

ect a

all↵ect all

phrase data

phrase

pairs. pairs.

probabilities

over a parallel corpus and ˆT denote a triangu-

bles. Here, we suggest a discriminative supervised

lated phrase table. probabilities

w.

We use similar w.

(through w

respecti

respective

ve phrase

learning

phrase

translation

notation for their

method

translation that can

features a,

lated phrase table. We use similar notation for (through

ord

features

↵ect

le

, le

all

xical-

their

word

alignments or

xical-

phraseOur

learning

alignments

dictionary

Our

idea isidea

pairs.

to re

method that can a or

isg to

ard

↵ect dictionary

entries)

re

wgard

ord w

all phrase

entries)

ord translation

translation distri- distri-

as the correct translation distributions, and use

pairs.

2.1

respective Triangulation

phrase translation(weak baseline)

weighting

weighting

features

features lex, lex

respective phrase Our idea

and ,

translation is and

to

the w the

re

features g

ord w

features , lexical-

ard

, le as

ord

word

xical- the

translation correct

translation

translation

b

Our idea translation

b

utions utions

deriv deri

ed

v

distri-

is to regard distrib

ed

fromword utions,

from

source-target

translation and

source-tar

distri- use

get bilingual

bilingual data

data

weighting features 2.1

lex,

T

andriangulation

the word

(weak

translation

probabilities

weighting

baseline)

probabilities

w.

w

b

.

them to learn discriminately: correct tar

(through get

features lex, and the word translation

b

(through

utions deriv w

ed ord wordwords

utions derived from

them

source-tar to

get learn discriminately:

bilingual datafrom

correct

alignments

source-tar or

get

tar

alignments get

or

bilingual w

dictionary ords

dictionary

dataentries) entries)

In phrase table triangulation, a source-target

probabilities w.

probabilities w. (through w should

ord

become

alignments

lik

or ely translations,

as the as the

dictionary

(through entries) and

w correct

incorrect

correct translation

ord

translation

alignments or

distrib distrib

utions, utions,

dictionary

and

entries)

useand use

In phrase table triangulation,

2.1 2.1

T

T

a

riangulation source-tar

riangulation (weak

(weak

get

baseline)

should

baseline)

become likely translations, and incorrect

phrase table T

as the

ones

correct

should be

translation down-weighted.

distributions,

as the and

them to

correct To

them

use generalize

to

learn learn

translation

be-

discriminately:

discriminately:

distrib correct correct

utions,

tar

and get tar

use w get

ords words

2.1 Triangulationphrase

st is constructed by combining a

(weak table T

baseline)st is

2.1 constructed

In TIn phrase

riangulation by combining

phrase table table

(weak baseline)

a

triangulation,

triangulation, a

a ones

source-tar should

source-tar

get

get be down-weighted.

should

them to

should

learn

become

T

become likely

discriminately: lik o

ely generalize be-

source-pivot and pivot-target phrase table Tsp, T

thempt,

to

yond

learn

the vocabulary

discriminately:

of

correct the

tar source-tar

get words

get data, we

correct translations,

translations,

target and

words and incorrect

incorrect

In phrase table

source-pivot

triangulation, and

a

pi

In vot-tar

source-tar

phrase

phrase get

phrase

get

table phrase

table

table

T

T

st

should table

is

T

st is

become

triangulation, sp

constructeda , T

lik pt

ely

by ,

constructed by yond

combining

source-target

the

translations,

a

v

combining ocab

should a

and

ones ulary

ones

incorrect

should

become likof

be

ely the

should

do source-tar

be do

wn-weighted.

translations,

get

T

and o data,

wn-weighted. To

incorrect we

each estimated on its respective parallel data. For

appeal to word embeddings.

generalize be-

generalize be-

phrase table Tst is each estimated

constructed by on its

phrase respecti

combining

source-piavot

table Tst v

source-pi e

v

andis parallel

ot and piv

ones

pi should

vot-tar data.

ot-tar

be

get

constructed do

by F

get or

each resulting phrase pair (

phrase

phrase table appeal

table

combining a

T

wn-weighted.

Tsp, TTo

pt, to

sp,

ones w

T ord

pt,

yond

shouldembeddings.

s

yond

be do the vocabulary

wn-weighted - 2 Preliminaries

3 Supervised Word Translations

2

Let s, Pr

p, t eliminaries

denote words and s, p, t denote phrases

3

While Supervised

interpolation Word

(Eq. 3) Translations

may help correct some

in the source, pivot, and target languages, respec-

of the noisy triangulated scores, its e↵ect is lim-

tiv Let

ely. s, p, t

Also, denote

let T

words

denote a and s,

phrasep, t denote

table

phrases

estimated

While

ited to

interpolation

phrase pairs

(Eq. 3)

appearing may

in help

both correct

phrase some

ta-

in the source, pivot, and target languages, respec-

of the noisy triangulated scores, its e↵ect is lim-

over a parallel corpus and ˆT denote a triangu-

bles. Here, we suggest a discriminative supervised

tively. Also, let T denote a phrase table estimated

ited to phrase pairs appearing in both phrase ta-

lated phrase table. We use similar notation for their

learning method that can a↵ect all phrase pairs.

over a parallel corpus and ˆT denote a triangu-

bles. Here, we suggest a discriminative supervised

respective phrase translation features , lexical-

Our idea is to regard word translation distri-

lated phrase

2 table.

Pr We use similar

eliminaries

notation for their

learning

3

method

Supervisedthat

W can

ord Ta↵ect all phrase

ranslations

pairs.

weighting features lex, and the word translation

butions derived from source-target bilingual data

respective phrase

Let s, p, t translation

denote wordsfeatures

and s, p, t , lexical-

denote phrases

Our idea

While

is to re

interpolation gard

(Eq. 3) word

may

translation

help correct

distri-

some

probabilities w.

(through word alignments or dictionary entries)

weighting features

in the

le

source,x,pi and

vot, the

and w

tar ord

get

translation

languages, respec- butions

of the deriv

noisy ed from source-tar

triangulated scores, get

its e bilingual

↵ect is lim- data

2 Preliminaries as the correct translation 3 Super

distrib vised

utions, Word

and Translations

use

2.1probabilities

tiv

T

w

ely. Also, let T denote a phrase table estimated

riangulation (weak baseline)

(through

ited to word

phrase alignments

pairs

or

appearing indictionary

both phrase entries)

ta-

over a parallel corpus

Let

and ˆT s

them to learn discriminately: correct target words

, p, t denote

denote a

words

triangu- and s, p,

bles. t denote

Here,

phrases

we suggest a While interpolation

discriminative

(Eq. 3)

supervised may help correct some

as the correct translation distributions, and use

In 2.1 Triangulation

phrase lated

table phrase (weak

table. We use

triangulation, a in the

baseline)

similar source,

source-tar pi

notation

get vot,

for and

their target

should languages,

learning

become respec-

method

lik that

ely canofa the

↵ect noisy

all

translations, triangulated

phrase

and pairs. scores,

incorrect

its e↵ect is lim-

them to learn discriminately: correct target words

phrase table respecti

Tst is ve phrase

constructed by tiv

translation ely. Also,

features let

combining ,aTle denote

xical- a

ones phrase

Our table

should idea

be estimated

is

do to reg ited

ard w to

ord

wn-weighted. phrase

To

pairs

translation appearing

distri-

generalize be-

in both phrase ta-

In phrase table triangulation, ovaer a parallel

source-tarcorpus

get

and ˆT denote

should

a triangu-

become likely bles. Here, we

translations,suggest

and a discriminati

incorrect

ve supervised

phrase

source-piv table

ot and T

weighting

piv features

ot-target lex, and

phrase the

table wTord translation

yond butions

the v deri

ocab ved from

ulary of source-tar

the

get bilingual

source-target

data

sp, Tpt,

data, we

st is constructed lated

by phrase table.

combining W

a e use similar

ones notation

should for

be their

do

learning method

wn-weighted. To that can a↵

generalize ect all

be- phrase pairs.

probabilities w.

(through word alignments or dictionary entries)

source-pi

each

vot

estimated and

on pi

its vot-target

respective parallel data. For

respecti

phrase

ve

table phrase

T

appeal to word embeddings.

sp, T translation

pt,

features

yond the v , le

ocab xical-

ulary of Our

the idea is to

source-tar reg

get ard word

data,

translation

we

distri-

as the correct translation distributions, and use

each

2.1

resulting

Triangulation

phrase pair (s, t (weak

), we weighting

baseline)

can also features

com-

lex, and

We the word

present translation

our

butions

formulation deri

in ved

the from source-tar

source-to-

get bilingual data

each estimated on its respective parallel data. For

appeal

them to

to word

learn embeddings.

discriminately: correct target words

pute an alignment ˆa as the most

probabilities

frequent

w.

align-

target direction. The tar (through word

get-to-source alignments

direction is or dictionary entries)

each resulting

In

phrase

phrase

pair

table (s, t), we can

triangulation, aalso com-

source-target

We present

should

our

become lik formulation

ely

in

translations,

the

and

source-to-

incorrect

as the correct translation distributions, and use

pute

ment

an alignment

phrase

obtained by

ˆa

table as

T

combining source-pivot and

st the

is

2.1

most

constructed Triangulation

frequent

by

align- (weak

combining a

baseline)

target

ones

obtained direction.

should

simply be

by The

doswapping the source and tar-

them

tar

to

wn-weighted. learn

To

discriminately:

get-to-source direction

generalize be-

correct

is

target words

pivment

ot-tar obtained

get

by

source-piv

alignments combining

ot and

a pivot-tar

sp and apt

In

get phrase

source-pi

phrase

across all vpitable

ot

table

v T

ot sp triangulation,

and,Tpt,

get yond a

obtainedthesource-tar

simply

v

languages. ocab get

by

ulary sw

of should

the

become

apping the

source-tar lik

get ely

source

data, translations,

and

we tar- and incorrect

pivot-tar

phrases p get

as alignments

each estimated

follows: {(s, taon its

) | 9p : ( phrase

respective

s, p) table

parallel

2 asp Tdata. For

appeal to word embeddings.

sp and apt across all st

pi

^ visot constructed

get

by combining

languages.

a

ones should be down-weighted. To generalize be-

each resulting phrase pair (

(p, t)

.

source-pi

s, t), we vot

can and pi

also vot-tar

com-

3.1 get phrase

We

Model table T

present our formulation in the source-to-

sp, Tpt,

yond the vocabulary of the source-target data, we

phrases

2 apt}p as follows: {(s, t) | 9p : (s, p) 2 asp ^

pute an alignment ˆa as

each

the

estimated

most

on

frequent its respecti

align-

Let3.1 v

tar e parallel

csup Model

get

data.

direction. For

The tar appeal to word

get-to-source embeddings.

(p

direction is

,

Thet) 2 apt}.

triangulated

source-to-target

lexical

ment obtained by

each

combining resulting phrase

source-pivot

pair

and (s,

st denote the number of times source word

t), we can

obtained

also

simplycom-

by sw

We

apping present

the

our

source formulation

and tar-

in the source-to-

The

weights,

triangulated

pivot-tar

denoted

get

c

lexst, alignments

are

a

approximated in two

s was aligned to target word t (in word alignment,

sp pute

source-to-tar

and aan

get

pt alignment

le

across all ˆa

xical

piv as

ot the

Letmost

csup

get st frequent

denote

languages. align-

the

target

number direction.

of times The target-to-source

source word

direction is

weights,

steps: First, denoted

word

c

phrases plex

as st, are

follows:

translation {(s, t)ment

scores | 9ˆ

w p

st obtained

approximated

: (s

are,in

p)

ap- by

tw

2 o

a combining

or s

in was

sp ^

the source-pi

alignedvot

to

dictionary).and

target

We obtained

word

define t simply

(in

the w

w by

ord

ord swapping

transla- the

alignment, source and tar-

steps: First,

proximated (p

by

sup

, w

t) ord

2

mar a

3.1 Model

sup

pt translation

}.

ginalizing over thepiv

piot-tar

scoresv ˆw

ot get

w alignments

st are

ords: ap- asp

tion and

or in apt

distrib across

theutionsallwpivot

dictionary).

(t | get

Wes) languages.

define

= c the word transla-

st /csup

s , where

The

triangulated

phrases p

source-to-tar as

getfollow - that fst := ˆwst. Therefore, the matrix A is a lin-

ear transformation between the source and target

that fst := ˆwst. Therefore, the matrix A is a lin-

embedding spaces, and h (now a scalar) quantifies

ear transformation between the source and target

how the triangulated scores ˆw are to be trusted.

embedding spaces, and h (now a scalar) quantifies

In the normalization factor Zs, we let t range

how the triangulated scores ˆw are to be trusted.

only over possible translations of s suggested by

In the normalization factor Zs, we let t range

either wsup or the triangulated word probabilities.

only over possible translations of s suggested by

That is: either wsup or the triangulated word probabilities.

that fst := ˆwst. Therefore, the matrix A is a lin-

T ear

(s) transformation

That

=

is:

{t | wsup(t between

| s) > the

0 source

_ ˆw(t and

| s) tar

> get

0}.

embedding spaces, and h (now a scalar) quantifies

how the triangulated

This

T

restriction (s) = scores

mak {

es t |

e w ˆw

supare

ffi

(t

cient |tos be

) trusted.

> 0 _ ˆw

computation (t | s

pos- ) > 0}.

In the normalization factor Zs, we let t range

sible, as otherwise the normalization term would

only over possible translations of s suggested by

This restriction makes efficient computation pos-

have

Figure 1: The (target-to-source) objective function

either

to be wsup or the triangulated

computed over the word probabilities.

entire target vocab-

sible, as otherwise the normalization term would

ulary.

per iteration. Applying batch Adagrad (blue) sig-

That is:

have to be computed over the entire target vocab-

Figure 1: The (target-to-source) objective function

Under this parameterization, our goal is to solve

nificantly accelerates convergence.

T (s)

per iteration. Applying batch Adagrad (blue) sig-

= {

ulary t. | wsup(t | s) > 0 _ ˆw(t | s) > 0}.

the following maximization problem:

Under this parameterization, our goal is to solve

nificantly accelerates convergence.

This restriction makes efficient

X

computation pos-

However, we found the supervised word trans-

sible,

max as

L( otherwise

the follo

A, h) = the

wing normalization

max

csup

st logterm

maximization

q(t w

| ould

problem:

s).

(4)

lation scores q to be too sharp, sometimes assign-

ha

Av,eh to be computedA,ohvers,the

t

entire target

X vocab-

Figure 1: The (target-to-source) objective function

However, we found the supervised word trans-

ing all probability mass to a single target word. We

ulary.

per iteration. Applying batch Adagrad (blue) sig-

max L(A, h) = max

csup

st log q(t | s).

(4)

lation scores q to be too sharp, sometimes assign-

3.2

Under this parameterization,

A,h

Optimization

our

A, goal

h

therefore interpolated q with the triangulated word

s is

,t to solve

nificantly accelerates convergence.

the following maximization problem:

ing all probability mass to a single target word. We

translation scores ˆw:

The

3.2

objective function in Eq. 4 is concave in both

X

Optimization

However, we found the supervised

therefore

word trans-

interpolated q with the triangulated word

A and h. max

This L(A

is , h) = max

because

csup

after taking the log, we

translation scores ˆw:

A

st log q(t | s).

(4)

lation scores q to be too sharp,

q =sometimes

q + (1 assign-

,h

A,h

The objective

) ˆw.

(6)

s,t

function in Eq. 4 is

are left with a weighted sum of linear and conca ing

conca

ve all

ve probability

in both mass to a single target word. We

(neg 3.2

ative Optimization

A and h.

log-sum-e 3

xp) .

This 2is O

terms p8

in m

because

A iza8

after

and h. o

W n

e

taking therefore

the

interpolated

log, we

q with the triangulated word

can

q = q + (1

) ˆw.

(6)

are left with a weighted sum of linear

To integrate the lexical weights induced by q

translation

and

scores

concave ˆw:

The

therefore objecti

reachve function

the

in

global Eq. 4 is conca

solution of ve in

the both

problem

(Eq. 2), we simply appended them as new features

using A and h.

(neThis

gati

gradient

is

vebecause

descent.

after

log-sum-e taking

xp) the log,

terms in we

A and h. We can

To integrate the lexical weights induced by q

l

in the q

phrase table in addition to the existing lexi-

The objec8v

= q + (1

) ˆw.

(6)

are e

left func8on in Eq. 4

with a weighted

therefore

sum

reach the is

of

co

linear n

global ca

and ve in

concav

solution b

e o

of th

theA and h

problem

Taking derivatives, the gradient is

(Eq. 2), we simply appended them as new features

cal weights. Following this, we can search for a

Ø We can

(nere

g ac

ativ h

e th

using e glob

log-sum-e al

xp)

gradient solu

terms8o

in nA

descent. of th

and h e

. p

W reob

canlem Tuosing

inte grad

grate ien

the t

lexical weights induced by q

in the phrase table in addition to the existing lexi-

value that maximizes B

on a tuning set.

descent

therefore reach the global solution of the problem

@L

X

@L

X

(Eq. 2), we simply appended them as new features

Taking derivatives, the gradient is

using gradient

m descent.

cal weights. Following this, we can search for a

stvsvT

t

mst fst

in the phrase table in addition to the existing lexi-

l Taking de

@ rA =

iv

T a8v

aking

@h =

s,tes,

deri vth

ati e

v g

es, rad

the

X ient i

gradients is s,t

cal

X weights.

3.4 Following

v this,

alue

Summary we

of can

that search

method for a

maximizes B

on a tuning set.

@L

@L

mstvsvTt

value

mst that

fst maximizes B

on a tuning set.

@L

X

@A =

@L

X

In summary, to improve upon a triangulated or in-

where the scalar m

@h =

m

csup

3.4 Summary of method

stvsvT

s,t

mst fst

s,t

@A =

st = t st

csup

@h =s q(t | s) for the

s

terpolated phrase table, we:

,t

s,t

3.4 Summary of method

current value of q.

In summary, to improve upon a triangulated or in-

where the scalar m

In summary, to improve upon a triangulated or in-

For where the

quick scalar

results,m

st = csup

we limited the st

csup

s q(t | s) for the

st = csup

number of gra-

st

csup

s q(t | s) for the

terpolated

1. phrase

Learntable,

w we:

terpolated

ord

phrase

translation

table,

distrib we:

utions q by super-

dient current

steps value

to

of

current

200 q

v .alue of q.

l For quick results, this re

and search li

selected mite

the d the n

iterationumb

that er of gradie

visionnt

against distributions wsup derived from

For quick

F results,

or

we

quick limited

results, the

wenumber of

limited gra-

the

minimized the total variation distance to wsup ov 1.

er Learn

number w

of ord translation

gra-

1. distrib

Learn utions

wordq by super-

translation distributions q by super-

steps to dient

200steps

andto s 200

elecand

te selected

d the i the

ter iteration

a8on ththat

dient steps to 200 and selected the

at minimized the

the total

source-target bilingual data (§3.1).

vision

iteration against

that distributions wsup derived from

a held out dev set:

vision against distributions wsup derived from

varia8o minimized

n distan the

ce total

to v

w ariation

sup ovedistance

r a heltod wsup

ou o

t ver

minimized the total variation distance

dev s to

e the

t:w source-tar

sup over get bilingual data (§3.1).

a held out de

X v set:

2. Smooth

the

the

source-tar

learned

get

distrib bilingual

utions q data

by

(§3.1).

interpo-

a held out dev set:

X

||q(· | s) wsup(· | s)||1.

2.

(5) Smooth the learned

lating

distrib

with

utions q by

triangulated interpo-

word translation scores

s

||q(· | s) wsup(

X

· | s)||1.

(5)

lating with triangulated

2.

word translation

Smooth the

scores

learned distributions q by interpo-

ˆw (§3.3).

s

||q(· | s) wsup(· | s)||1. ˆw (§3.3).(5)

lating with triangulated word translation scores

15/10/15

We obtained

2

better015©Askiv

con a vMiu

er ra AHC- ‐Lab

gence , IS, NAIS

rate T by us-

10

We obtained better convergence rate by us-

3. Compute ˆw

ne (

w§3.3).

ing a

lexical weights and append them

ing a batch

batch v version

ersion of

of the

the e

3. Compute new lexical weights and append them

↵

e ecti

↵ ve

ectivand

e

easy-

and easy-

to the phrase table (§3.3).

to-implement

We

to-implement

Adagrad

obtained

Adagrad technique

better

technique(Duchi< - 3.2 Op8miza8on (cont’d)

that fst := ˆwst. Therefore, the matrix A is a lin-

ear transformation between the source and target

embedding spaces, and h (now a scalar) quantifies

how the triangulated scores ˆw are to be trusted.

In the normalization factor Zs, we let t range

only over possible translations of s suggested by

either wsup or the triangulated word probabilities.

That is:

T (s) = {t | wsup(t | s) > 0 _ ˆw(t | s) > 0}.

This restriction makes efficient computation pos-

sible, as otherwise the normalization term would

have to be computed over the entire target vocab-

Figure 1: The (target-to-source) objective function

ulary.

per iteration. Applying batch Adagrad (blue) sig-

Under this parameterization, our goal is to solve

nificantly accelerates convergence.

15/10/15

2015©Akiva Miura AHC- ‐Lab, IS, NAIST

11

the following maximization problem:

X

However, we found the supervised word trans-

max L(A, h) = max

csup

A

st log q(t | s).

(4)

lation scores q to be too sharp, sometimes assign-

,h

A,h s,t

ing all probability mass to a single target word. We

3.2 Optimization

therefore interpolated q with the triangulated word

translation scores ˆw:

The objective function in Eq. 4 is concave in both

A and h. This is because after taking the log, we

q = q + (1

) ˆw.

(6)

are left with a weighted sum of linear and concave

(negative log-sum-exp) terms in A and h. We can

To integrate the lexical weights induced by q

therefore reach the global solution of the problem

(Eq. 2), we simply appended them as new features

using gradient descent.

in the phrase table in addition to the existing lexi-

Taking derivatives, the gradient is

cal weights. Following this, we can search for a

value that maximizes B

on a tuning set.

@L

X

@L

X

mstvsvT

mst fst

@A =

t

@h =

s,t

s,t

3.4 Summary of method

In summary, to improve upon a triangulated or in-

where the scalar mst = csup

st

csup

s q(t | s) for the

terpolated phrase table, we:

current value of q.

For quick results, we limited the number of gra-

1. Learn word translation distributions q by super-

dient steps to 200 and selected the iteration that

vision against distributions wsup derived from

minimized the total variation distance to wsup over

the source-target bilingual data (§3.1).

a held out dev set:

X

2. Smooth the learned distributions q by interpo-

||q(· | s) wsup(· | s)||1.

(5)

lating with triangulated word translation scores

s

ˆw (§3.3).

We obtained better convergence rate by us-

ing a batch version of the e

3. Compute new lexical weights and append them

↵ective and easy-

to-implement Adagrad technique (Duchi et al.,

to the phrase table (§3.3).

2011). See Figure 1.

4 Experiments

3.3 Re-estimating lexical weights

To test our method, we conducted two low-

Having learned the model (A and h), we can now

resource translation experiments using the

use q(t | s) to estimate the lexical weights (Eq. 2)

phrase-based MT system Moses (Koehn et al.,

of any aligned phrase pairs (s, t, ˆa), assuming it is

2007).

composed of embeddable words.

1081 - that fst := ˆwst. Therefore, the matrix A is a lin-

ear transformation between the source and target

embedding spaces, and h (now a scalar) quantifies

how the triangulated scores ˆw are to be trusted.

In the normalization factor Zs, we let t range

only over possible translations of s suggested by

either wsup or the triangulated word probabilities.

That is:

T (s) = {t | wsup(t | s) > 0 _ ˆw(t | s) > 0}.

This restriction makes efficient computation pos-

sible, as otherwise the normalization term would

have to be computed over the entire target vocab-

Figure 1: The (target-to-source) objective function

ulary.

per iteration. Applying batch Adagrad (blue) sig-

Under this parameterization, our goal is to solve

nificantly accelerates convergence.

the following maximization problem:

X

However, we found the supervised word trans-

max L(A, h) = max

csup

A

st log q(t | s).

(4)

lation scores q to be too sharp, sometimes assign-

,h

A,h

s,t

ing all probability mass to a single target word. We

3.2 Optimization

therefore interpolated q with the triangulated word

translation scores ˆw:

The objective function in Eq. 4 is concave in both

A and h. This is because after taking the log, we

q = q + (1

) ˆw.

(6)

are left with a weighted sum of linear and concave

(negative log-sum-exp) terms in A and h. We can

To integrate the lexical weights induced by q

therefore reach the global solution of the problem

(Eq. 2), we simply appended them as new features

using gradient descent.

in the phrase table in addition to the existing lexi-

Taking derivatives, the gradient is

cal weights. Following this, we can search for a

value that maximizes B

on a tuning set.

@L

X

@L

X

mstvsvT

mst fst

@A =

t

@h =

s,t

s,t

3.4 Summary of method

In summary, to improve upon a triangulated or in-

where the scalar mst = csup

st

csup

s q(t | s) for the

terpolated phrase table, we:

current value of q.

For quick results, we limited the number of gra-

1. Learn word translation distributions q by super-

dient steps to 200 and selected the iteration that

that fst := ˆwst. Therefore, the matrix A is a lin-

vision against distributions wsup derived from

ear transformation between minimized

the source

the

and

total

target variation distance to wsup over

the source-target bilingual data (§3.1).

embedding spaces, and h (no a

w held

a

out

scalar) dev set:

quantifies

how the triangulated scores ˆw are to be

X

trusted.

2. Smooth the learned distributions q by interpo-

In the normalization factor Zs, we let t

||q

range(· | s) wsup(· | s)||1.

(5)

lating with triangulated word translation scores

only over possible translations of s

s

suggested by

ˆw (§3.3).

either wsup or the triangulated word probabilities.

That is:

We obtained better convergence rate by us-

ing a batch version of the e

3. Compute new lexical weights and append them

↵ective and easy-

T (s) = {t | wsup(t | s) > 0 to-implement

_ ˆw(t | s) > 0}. Adagrad technique (Duchi et al.,

to the phrase table (§3.3).

2011). See Figure 1.

This restriction makes efficient computation pos-

4 Experiments

sible, as otherwise the normalization term would

3.3 Re- ‐es8ma8ng lexical weights

3.3 Re-estimating lexical

Figure weights

have to be computed over the entire target vocab-

1: The (target-to-source)

T

objecti ove test our

function

method, we conducted two low-

ulary.

Having learned the model

per

(A and

iteration. h), we

Applying can no

batch w

resource

Adagrad (blue) sig- translation experiments using the

l Having learned the model (A and h), we can now use q(t | s) to

Under this

use

parameterization,

q

our (t |

goal s)

is to

to estimate

solve

nificantly accelerates convergence.

es8m

the ate

le the

xicallexical wei

weightsghts (E

(Eq. q. 2

2) ) of any aligned phras

phrase-based e MT system Moses (Koehn et al.,

the following maximization of any

problem: aligned phrase pairs

pairs ( s , t , ˆa ,

), assuming it

assuming is c

it om

is posed of em

2007). beddable words

X composed of embeddable

Howords.

wever, we found the supervised word trans-

max L(A, h) = max

csup

A

st log q(t | s).

(4)

lation scores q to be too sharp, sometimes assign-

,h

A,h

l However, the authors found the supervised word transla8on

s,t

scor

ing alles q to be to

probabilityo shar

mass p,

toso

a me8m

single es as

tar si

getgn

wing al

ord. l p

W reobability

mass to

therefore a single targe

interpolated t

q word

with

3.2 Optimization

the 1081

triangulated word

Ø They ther

translation efore in

scores te

ˆw:rpolated q with the triangulated word

The objective function in Eq. 4 is concave in both

transla8on scores:

A and h. This is because after taking the log, we

q = q + (1

) ˆw.

(6)

are left with a weighted sum of linear and concave

• To integrate the lexical weights induced by qβ (Eq. 2), they

(negative log-sum-exp) terms in A and h. We can

To sim

inte ply ap

grate pen

the de

le d the

xical m as new

weights features i

induced n th

by e p

q hrase table

therefore reach the global solution of the problem

(Eq. 2), we simply appended them as new features

using gradient descent.

15/10/15

2015©Akiva Miura AHC- ‐Lab, IS, NAIST

12

in the phrase table in addition to the existing lexi-

Taking derivatives, the gradient is

cal weights. Following this, we can search for a

value that maximizes B

on a tuning set.

@L

X

@L

X

mstvsvT

mst fst

@A =

t

@h =

s,t

s,t

3.4 Summary of method

In summary, to improve upon a triangulated or in-

where the scalar mst = csup

st

csup

s q(t | s) for the

terpolated phrase table, we:

current value of q.

For quick results, we limited the number of gra-

1. Learn word translation distributions q by super-

dient steps to 200 and selected the iteration that

vision against distributions wsup derived from

minimized the total variation distance to wsup over

the source-target bilingual data (§3.1).

a held out dev set:

X

2. Smooth the learned distributions q by interpo-

||q(· | s) wsup(· | s)||1.

(5)

lating with triangulated word translation scores

s

ˆw (§3.3).

We obtained better convergence rate by us-

ing a batch version of the e

3. Compute new lexical weights and append them

↵ective and easy-

to-implement Adagrad technique (Duchi et al.,

to the phrase table (§3.3).

2011). See Figure 1.

4 Experiments

3.3 Re-estimating lexical weights

To test our method, we conducted two low-

Having learned the model (A and h), we can now

resource translation experiments using the

use q(t | s) to estimate the lexical weights (Eq. 2)

phrase-based MT system Moses (Koehn et al.,

of any aligned phrase pairs (s, t, ˆa), assuming it is

2007).

composed of embeddable words.

1081 - that fst := ˆwst. Therefore, the matrix A is a lin-

ear transformation between the source and target

embedding spaces, and h (now a scalar) quantifies

how the triangulated scores ˆw are to be trusted.

In the normalization factor Zs, we let t range

only over possible translations of s suggested by

either wsup or the triangulated word probabilities.

That is:

T (s) = {t | wsup(t | s) > 0 _ ˆw(t | s) > 0}.

This restriction makes efficient computation pos-

sible, as otherwise the normalization term would

have to be computed over the entire target vocab-

Figure 1: The (target-to-source) objective function

ulary.

per iteration. Applying batch Adagrad (blue) sig-

Under this parameterization, our goal is to solve

nificantly accelerates convergence.

the following maximization problem:

X

However, we found the supervised word trans-

max L(A, h) = max

csup

A

st log q(t | s).

(4)

lation scores q to be too sharp, sometimes assign-

,h

A,h s,t

ing all probability mass to a single target word. We

3.2 Optimization

therefore interpolated q with the triangulated word

translation scores ˆw:

The objective function in Eq. 4 is concave in both

A and h. This is because after taking the log, we

q = q + (1

) ˆw.

(6)

are left with a weighted sum of linear and concave

(negative log-sum-exp) terms in A and h. We can

To integrate the lexical weights induced by q

therefore reach the global solution of the problem

(Eq. 2), we simply appended them as new features

using gradient descent.

in the phrase table in addition to the existing lexi-

Taking derivatives, the gradient is

cal weights. Following this, we can search for a

value that maximizes B

on a tuning set.

@L

X

@L

X

mstvsvTt

3.4

m stSfu

st mmary of method

@A =

@h =

s,t

s,t

3.4 Summary of method

In summary, to improve upon a triangulated or in-

where the scalar In

m s

sup

stum

= m

cary, to improve upon a triangulated or interpolated

st

csup

s q(t | s) for the

terpolated phrase table, we:

current value of p

q. hrase table, the authors:

For quick

results, we limited the number of gra-

1. Learn word translation distributions q by super-

dient steps to

1.

200 Le

andarn word

selected tran

the sla8on di

iterationstribu

that 8ons q by supervision against

vision against distributions wsup derived from

minimized the total d

v istribu8

ariation ons wsup

distance de

to riv

w ed

sup fr

o o

v m

er the source- ‐target bilingual data

the source-target bilingual data (§3.1).

a held out dev set: (§3.1)

X

2. Smooth the learned distributions q by interpo-

|| 2.

q( · | Sm

s) ooth

wsup

th(e· |lear

s)||n1e.d distribu8o

(5) ns q by inter

latingpola8

with ng with

triangulated word translation scores

s

triangulated word transla8on scores ˆw (§3.3).

We obtained better convergence rate by us-

ing a batch v 3.

3. Compute new lexical weights and append them

C

ersion om

of pute

the n

e e

↵ w le

ecti x

v ic

e al we

and ights an

easy- d append them to the phrase

to-implement

tab

Adagrad le (§3.3)

technique (Duchi et al.,

to the phrase table (§3.3).

2011). See Figure 1.

15/10/15

2015©Akiva Miura AHC- ‐Lab, IS, NAIST

13

4 Experiments

3.3 Re-estimating lexical weights

To test our method, we conducted two low-

Having learned the model (A and h), we can now

resource translation experiments using the

use q(t | s) to estimate the lexical weights (Eq. 2)

phrase-based MT system Moses (Koehn et al.,

of any aligned phrase pairs (s, t, ˆa), assuming it is

2007).

composed of embeddable words.

1081 - 4. Experiments

l To test the proposed method, the authors conducted two low- ‐

resource transla8on experiments using Moses

Transla8on Tasks:

l Fixing the pivot language to English, they applied their method

on two data scenarios:

1. Spanish- ‐to- ‐French:

two related languages used to simulate a low- ‐resource

seeng. The baseline is phrase table interpola8on (Eq. 3)

2. Malagasy- ‐to- ‐French:

two unrelated languages for which they have a small

dic8onary, but no parallel corpus. The baseline is

triangula8on alone.

15/10/15

2015©Akiva Miura AHC- ‐Lab, IS, NAIST

14 - 4.1 Data

language

words

Fixing the pivot language to English, we applied

French

1.5G

our method on two data scenarios:

Spanish

1.4G

Malagasy 58M

1. Spanish-to-French: two related languages

used to simulate a low-resource setting. The

Table 2: Size of monolingual corpus per language

baseline is phrase table interpolation (Eq. 3).

as measured in number of tokens.

2. Malagasy-to-French: two unrelated languages

for which we have a small dictionary, but no

4.2 Spanish-French Results

parallel corpus (aside from tuning and testing

To produce wsup, we aligned the small Spanish-

data). The baseline is triangulation alone 4.1

(there

French parallel corpus in both directions, and

is no source-target model to interpolate with).

Data

symmetrized using the intersection heuristic. This

was done to obtain high precision alignments (the

Table 1 lists some statistics of the bilin-

often-used grow-diag-final-and heuristic is opti-

D

gual atas

dataets:

we used. European-language bitexts mized for phrase extraction, not precision).

were extracted from Europarl (Koehn, 2005). For

We used the skip-gram model to estimate the

•

Malag European

asy-English,- ‐lan

we guag

used e

the bitex

Globalt w

V ere

oices ex

par- tracted fr

Spanish om

and Europ

French ar

w l

ord embeddings and set the

•

allel Fo

data r aMal

v

ag

ailable asy- ‐En

online.1 glis

Theh, Glo

Malag bal Voice

asy-French s parallel d

dimension ata

to av

d ai

= lab

200le on

and line

context window to

dictionary was extracted from online resources2

w

•

= 5 (default). Subsequently, to run our method,

and

T

the he Mal

small

agas

Malag y- ‐Frenc

asy-French h d

tunei/c8o

test nar

sets y fr

wereom online resources and small

we filtered out source and target words that either

e

Mal

xtracted3 agas

from y- ‐Fre

Global nc

V h tu

oices. ne/tests from Global

did Vo

not ices

appear in the triangulation, or, did not have

lines of data

an embedding. We took words that appeared more

language pair train

tune test

than 10 times in the parallel corpus for the training

set (

sp-fr

4k

1.5k 1.5k

⇠690 words), and between 5–9 times for the

held out dev set (

mg-fr

1.1k

1.2k 1.2k

⇠530 words). This was done in

4.1 Data

both source-target and

language target-source

words

directions.

sp-en

50k

–

–

Fixing the pivot language to English, we applied

In Table 3 we show

French that the distrib

1.5G

utions learned

mg-en

100k –

–

our method on two data scenarios:

by our method are much

Spanish

better

1.4G approximations of

en-fr

50k

–

–

wsup compared to those

Malagasy obtained

58M

by triangulation.

1. Spanish-to-French: two related languages

Table 1:

used Bilingual

to

datasets.

simulate a lo

Legend:

w-resource sp=Spanish,

setting. The

T Method

able 2: Size of

source!tar

monolingual get

corpus target

per !source

language

fr=French,

baseline en

is =English,

phrase

mg

table =Malagasy

interpolation .(Eq. 3).

as triangulation

measured in number 71.6%

of tokens.

72.0%

our scores

30.2%

33.8%

2. TMalagasy-to-French: two unrelated languages

15/10/15

able 2 lists token statistics of 2015

the©Akiva Miura

monolin- AHC- ‐Lab

4.2 , IS, NAIST

Spanish-French Results

15

gual data

for

used.

which we We

hav used

e a

word2v

small

ec4 to

dictionary, generate

but no

Table 3: Average total variation distance (Eq. 5)

To produce wsup, we aligned the small Spanish-

French, Spanish

parallel

and

corpus

Malag

(aside

asy

from word

tuning embeddings.

and testing

to the dev set portion of wsup (computed only over

French parallel corpus in both directions, and

The French

data). The and Spanish

baseline is

embeddings

triangulation

were

alone

(in-

(there

words whose translations in wsup appear in the tri-

symmetrized using the intersection heuristic. This

dependently)

is no

estimated

source-target

ov

modelerto their combined

interpolate with).to- angulation). Using word embeddings, our method

was done to obtain high precision alignments (the

kenized

Table and

1

lowercased

lists some Gigaword5

statistics of and

the Leipzig

is able to better generalize on the dev set.

bilin-

often-used grow-diag-final-and heuristic is opti-

news

gual corpora.6

data we The

used. Malagasy embeddings

European-language

were

bitexts

mized for phrase extraction, not precision).

similarly

were e

estimated

xtracted

o

from ver data

Europarlform

(K Global

oehn,

Voices,7

We then examined the e↵ect of appending our

2005). For

We used the skip-gram model to estimate the

the Malag

Malag

asy Wikipedia

asy-English, we

and

used thethe Malag

Global V asy

oices Com-

supervised lexical weights. We fixed the word

par-

Spanish and French word embeddings and set the

mon

allel Cra

datawl.8

av In addition,

ailable online.1 we estimated

The Malag

a 5-gram

level interpolation := 0.95 (e↵ectively assigning

asy-French

dimension to d = 200 and context window to

French language

dictionary was e model

xtractedover the

from

French

online

monolin-

very little mass to triangulated word translations

resources2

w = 5 (default). Subsequently, to run our method,

gual

and data.

ˆw) and searched for ↵

the small Malagasy-French tune/test sets were

2 {0.9, 0.8, 0.7, 0.6} in Eq. 3

we filtered out source and target words that either

to maximize B

on the tuning set.

extracted3 from Global Voices.

1http:

did not appear in the triangulation, or, did not have

//www.ark.cs.cmu.edu/global-voices

Our MT results are reported in Table 4. While

2http://motmalgache.org/bins/homePage

an embedding. We took words that appeared more

3

lines of data

https:

interpolation improves over triangulation alone by

//github.com/vchahun/gv-crawl

than 10 times in the parallel corpus for the training

4https://radimrehurek.com

language pair

/gensim

train

/models

tune

/word2v

test

ec.html

+0.8 B

, our method adds another +0.7 B

on

set (

5http:

⇠690 words), and between 5–9 times for the

//catalog.ldc.upenn.edu

sp-fr

4k

1.5k 1.5k

top of interpolation, a statistically significant gain

6http:

held out dev set (

//corpora.uni-leipzig.de/download.html

mg-fr

1.1k

1.2k 1.2k

⇠530 words). This was done in

7http:

(p < 0.01) according to a bootstrap resampling

//www.isi.edu/˜qdou/downloads.html

both source-target and target-source directions.

8https: sp-en

50k

–

–

//commoncrawl.org/the-data/

significance test (Koehn, 2004).

In Table 3 we show that the distributions learned

mg-en

100k –

–

by our method are much better approximations of

en-fr

50k

–

–

1082

wsup compared to those obtained by triangulation.

Table 1: Bilingual datasets. Legend: sp=Spanish,

Method

source!target target!source

fr=French, en=English, mg=Malagasy.

triangulation

71.6%

72.0%

our scores

30.2%

33.8%

Table 2 lists token statistics of the monolin-

gual data used. We used word2vec4 to generate

Table 3: Average total variation distance (Eq. 5)

French, Spanish and Malagasy word embeddings.

to the dev set portion of wsup (computed only over

The French and Spanish embeddings were (in-

words whose translations in wsup appear in the tri-

dependently) estimated over their combined to-

angulation). Using word embeddings, our method

kenized and lowercased Gigaword5 and Leipzig

is able to better generalize on the dev set.

news corpora.6 The Malagasy embeddings were

similarly estimated over data form Global Voices,7

We then examined the e↵ect of appending our

the Malagasy Wikipedia and the Malagasy Com-

supervised lexical weights. We fixed the word

mon Crawl.8 In addition, we estimated a 5-gram

level interpolation := 0.95 (e↵ectively assigning

French language model over the French monolin-

very little mass to triangulated word translations

gual data.

ˆw) and searched for ↵ 2 {0.9, 0.8, 0.7, 0.6} in Eq. 3

to maximize B

on the tuning set.

1http://www.ark.cs.cmu.edu/global-voices

Our MT results are reported in Table 4. While

2http://motmalgache.org/bins/homePage

3https:

interpolation improves over triangulation alone by

//github.com/vchahun/gv-crawl

4https://radimrehurek.com/gensim/models/word2vec.html

+0.8 B

, our method adds another +0.7 B

on

5http://catalog.ldc.upenn.edu

top of interpolation, a statistically significant gain

6http://corpora.uni-leipzig.de/download.html

7http:

(p < 0.01) according to a bootstrap resampling

//www.isi.edu/˜qdou/downloads.html

8https://commoncrawl.org/the-data/

significance test (Koehn, 2004).

1082 - 4.1 Data

language

words

Fixing the pivot language to English, we applied

French

1.5G

our method on two data scenarios:

Spanish

1.4G

Malagasy 58M

1. Spanish-to-French: two related languages

used to simulate a low-resource setting. The

Table 2: Size of monolingual corpus per language

baseline is phrase table interpolation (Eq. 3).

as measured in number of tokens.

2. Malagasy-to-French: two unrelated languages

for which we have a small dictionary, but no

4.2 Spanish-French Results

parallel corpus (aside from tuning and testing

To produce wsup, we aligned the small Spanish-

data). The baseline is triangulation alone (there

French parallel corpus in both directions, and

is no source-target model to interpolate with).

symmetrized using the intersection heuristic. This

was done to obtain high precision alignments (the

Table 1 lists some statistics of the bilin-

often-used grow-dia

4.2

g-final-and Span

heuristic is ish

opti- - ‐French Results

gual data we used. European-language bitexts

mized for phrase extraction, not precision).

were extracted from Europarl (Koehn, 2005). For

We used the skip-gram model to estimate the

Malagasy-English, we used the Global Voices par-

Spanish and French word embeddings and set the

allel data available online.1 The Malag

l

asy-French To produce

dimension to d w

= sup,

200 the

and auth

contextors al

windo iwgn

to ed the small Spanish- ‐French parallel

dictionary was extracted from online resources2

w = 5 (default). Subsequently, to run our method,

and the small Malagasy-French tune/test sets were

cor

we pus i

filtered n b

out oth

sourcedire

and c8

tar o

get n

w s, an

ords

d

that sym

either metrized using the intersec8on

extracted3 from Global Voices.

heu

did ri

nots8c to

appear

in ob

the tain high

triangulation, p

orr,ec

did isi

noton

ha

v (

e not grow- ‐diag- ‐ﬁnal- ‐and)

lines of data

an embedding. We took words that appeared more

l To tr

than ai

10 n sk

times ip

in - ‐g

the ram

parallelmode

corpus l,

for di

themensi

training on d = 200 and context window w = 5

language pair train

tune test

set ( 690 words), and between 5–9 times for the

sp-fr

4k

1.5k 1.5k

l They⇠ took words that appeared more than 10 8mes in the parallel

held out dev set (

mg-fr

1.1k

1.2k 1.2k

⇠530 words). This was done in

corp

both us for

source-tar th

get e tr

and ai

tar ning se

get-source t (〜69

directions. 0 words), and 5- ‐9 8mes for the held

sp-en

50k

–

–

In Table 3 we show that the distributions learned

mg-en

100k –

–

out dev set (〜530 words)

by our method are much better approximations of

en-fr

50k

–

–

l Th

w e

sup y ﬁxed

compared βto := 0

those .95 to

obtained ex

by amine th

triangulation. e eﬀect of their supervised method

Table 1: Bilingual datasets. Legend: sp=Spanish,

Method

source!target target!source

Method

↵

tune

test

We then used our model to generate new lexi-

fr=French, en=English, mg=Malagasy.

triangulation

71.6%

72.0%

source-target

–

26.8 25.3

cal weights for phrase pairs appearing in a trian-

our scores

30.2%

33.8%

triangulation

–

29.2 28.4

gulated or interpolated phrase table and demon-

Table 2 lists token statistics of the monolin-

interpolation

0.7 30.2 29.2

strated improvements in MT quality on two tasks.

interpolation+our scores 0.6 30.8 29.9

This is despite the fact that the distributions (wsup)

gual data used. We used word2vec4 to generate

Table 3: Average total variation distance (Eq. 5)

we fit our model to were estimated automatically,

French, Spanish and Malagasy word embeddings.

to the dev set portion of wsup (computed only over

Table 4: Spanish-French B

scores. Append-

or even na¨ıvely as uniform distributions.

The French and Spanish embeddings were (in-

words whose translations in wsup appear in the tri-

ing lexical weights obtained by supervision over

a small source-target corpus significantly out-

Acknowledgements

dependently) estimated over their combined to-

angulation). Using word embeddings, our method

performs phrase table interpolation (Eq. 3) by

kenized and lowercased Gigaword5 and Leipzig

is able to better generalize on the dev set.

The authors would like to thank Daniel Marcu and

+0.7 B

.

news corpora.6 The Malagasy embeddings were

Kevin Knight for initial discussions and a sup-

15/10/15

We then examined the e

portive research environment at ISI, as well as the

↵ect

2

of 015©Akiva Mi

appending ura

our AHC- ‐Lab, IS, NAIST

16

similarly estimated over data form Global Voices,7

4.3 Malagasy-French Results

anonymous reviewers for their helpful comments.

the Malagasy Wikipedia and the Malagasy Com-

supervised lexical weights. We fixed the word

For Malagasy-French, the wsup distributions used

This research was supported in part by a Google

mon Crawl.8 In addition, we estimated a 5-gram

level interpolation := 0.95 (e↵ectively assigning

for supervision were taken to be uniform distri-

Faculty Research Award to Chiang.

French language model over the French monolin-

very little mass to triangulated word translations

butions over the dictionary translations. For each

gual data.

ˆw) and searched for ↵ 2 {0.9, 0.8, 0.7, 0.6} in Eq. 3

training direction, we used a 70%/30% split of the

References

to maximize B

on the tuning set.

dictionary to form the train and dev sets.

1http:

Having significantly less Malagasy monolin-

Trevor Cohn and Mirella Lapata. 2007. Machine

//www.ark.cs.cmu.edu/global-voices

Our MT results are reported in Table 4. While

2http:

translation by triangulation: Making e↵ective use of

//motmalgache.org/bins/homePage

gual data, we used d = 100 dimensional embed-

multi-parallel corpora. In Proc. ACL, pages 728–

3https:

interpolation improves over triangulation alone by

//github.com/vchahun/gv-crawl

dings and a w = 3 context window to estimate both

735.

4https://radimrehurek.com/gensim/models/word2vec.html

+0.8 B

, our method adds another +0.7 B

on

Malagasy and French words.

5http://catalog.ldc.upenn.edu

top of interpolation, a statistically significant gain

As before, we added our supervised lexical

John Duchi, Elad Hazan, and Yoram Singer. 2011.

6http:

Adaptive subgradient methods for online learning

//corpora.uni-leipzig.de/download.html

weights as new features in the phrase table. How-

7http:

(p < 0.01) according to a bootstrap resampling

and stochastic optimization. J. Machine Learning

//www.isi.edu/˜qdou/downloads.html

ever, instead of fixing

= 0.95 as above, we

Research, 12:2121–2159, July.

8https://commoncrawl.org/the-data/

significance test (Koehn, 2004).

searched for 2 {0.9, 0.8, 0.7, 0.6} in Eq. 6 to max-

Philipp Koehn, Franz Josef Och, and Daniel Marcu.

imize B

on a small tune set. We report our re-

2003. Statistical phrase-based translation. In Proc.

sults in Table 5. Using only a dictionary, we are

1082

NAACL HLT, pages 48–54.

able to improve over triangulation by +0.5 B

, a

Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris

statistically significant di↵erence (p < 0.01).

Callison-Burch, Marcello Federico, Nicola Bertoldi,

Brooke Cowan, Wade Shen, Christine Moran,

Method

tune

test

Richard Zens, Chris Dyer, Ondˇrej Bojar, Alexan-

triangulation

–

12.2 11.1

dra Constantin, and Evan Herbst. 2007. Moses:

triangulation+our scores 0.6 12.4 11.6

Open source toolkit for statistical machine transla-

tion. In Proc. ACL, Interactive Poster and Demon-

Table 5: Malagasy-French B

. Supervision with

stration Sessions, pages 177–180.

a dictionary significantly improves upon simple

Philipp Koehn. 2004. Statistical significance tests for

triangulation by +0.5 B

.

machine translation evaluation. In Proc. EMNLP,

pages 388–395.

Philipp Koehn. 2005. Europarl: A parallel corpus for

5 Conclusion

statistical machine translation. In Proc. MT Summit,

pages 79–86.

In this paper, we argued that constructing a trian-

gulated phrase table independently from even very

Tomas Mikolov, Kai Chen, Greg Corrado, and Je↵rey

limited source-target data (a small dictionary or

Dean. 2013. Efficient estimation of word represen-

tations in vector space. In Proc. ICLR, Workshop

parallel corpus) underutilizes that parallel data.

Track.

Following this argument, we designed a super-

vised learning algorithm that relies on word trans-

Masao Utiyama and Hitoshi Isahara. 2007. A com-

parison of pivot methods for phrase-based statistical

lation distributions derived from the parallel data

machine translation. In Proc. HLT-NAACL, pages

as well as a distributed representation of words

484–491.

(embeddings). The latter enables our algorithm to

Hua Wu and Haifeng Wang. 2007. Pivot language ap-

assign translation probabilities to word pairs that

proach for phrase-based statistical machine transla-

do not appear in the source-target bilingual data.

tion. In Proc. ACL, pages 856–863.

1083 - Method

↵

tune

test

We then used our model to generate new lexi-

source-target

–

26.8 25.3

cal weights for phrase pairs appearing in a trian-

triangulation

–

29.2 28.4

gulated or interpolated phrase table and demon-

interpolation

0.7 30.2 29.2

strated improvements in MT quality on two tasks.

interpolation+our scores 0.6 30.8 29.9

This is despite the fact that the distributions (wsup)

we fit our model to were estimated automatically,

Table 4: Spanish-French B

scores. Append-

or even na¨ıvely as uniform distributions.

ing lexical weights obtained by supervision over

a small source-target corpus significantly out-

Acknowledgements

performs phrase table interpolation (Eq. 3) by

The authors would like to thank Daniel Marcu and

+0.7 B

.

Kevin Knight for initial discussions and a sup-

portive research environment at ISI, as well as the

4.3 Malagasy-French Results

anonymous reviewers for their helpful comments.

For Malagasy-French, the wsup distributions used

This research was supported in part by a Google

for supervision were taken to be uniform distri-

Faculty Research Award to Chiang.

butions over the dictionary translations. For each

training direction, we used a 70%/30% split of the

dictionary to form the train and dev sets.

References

Having significantly less Malagasy monolin-

Trevor Cohn and Mirella Lapata. 2007. Machine

gual 4.3

data, Mal

we

ag

used d as

=

y- ‐

100 French

dimensional Resu

embed- lts translation by triangulation: Making e↵ective use of

dings and a w

multi-parallel corpora. In Proc. ACL, pages 728–

= 3 context window to estimate both

735.

Malagasy and French words.

John Duchi, Elad Hazan, and Yoram Singer. 2011.

l The wsup distrib

As u8on

before,s us

we ed fo

addedr su

our pervision

supervised wleere

xical taken to b

Adaptive

e subgradient methods for online learning

uniform distribu8

weights on

as s neowver the

features d

in ic8

the onary

phrase tran

table. sla8

How-ons and stochastic optimization. J. Machine Learning

ever, instead of fixing

= 0.95 as above, we

Research, 12:2121–2159, July.

• For each trainin

searched g

fordire

2 {c08

.9o

, n

0. ,8 ,th

0. e

7, y0 .u

6}se

in d 7

Eq. 06%

to /30%

max- split of the

Philipp Koehn, Franz Josef Och, and Daniel Marcu.

dic8onary to

imize fo

B rm th

on a e trai

small n an

tune d de

set. Wve sets

report our re-

2003. Statistical phrase-based translation. In Proc.

sults in Table 5. Using only a dictionary, we are

NAACL HLT, pages 48–54.

l To train skip- ‐gr

able am

to mo

impro d

ve eol,

v d

er = 100, w

triangulation= 3

by +0.5 B

, a

Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris

statistically significant di↵erence (p < 0.01).

Callison-Burch, Marcello Federico, Nicola Bertoldi,

Brooke Cowan, Wade Shen, Christine Moran,

Method

tune

test

Richard Zens, Chris Dyer, Ondˇrej Bojar, Alexan-

triangulation

–

12.2 11.1

dra Constantin, and Evan Herbst. 2007. Moses:

triangulation+our scores 0.6 12.4 11.6

Open source toolkit for statistical machine transla-

tion. In Proc. ACL, Interactive Poster and Demon-

Table 5: Malagasy-French B

. Supervision with

stration Sessions, pages 177–180.

a dictionary significantly improves upon simple

Philipp Koehn. 2004. Statistical significance tests for

triangulation by +0.5 B

.

machine translation evaluation. In Proc. EMNLP,

pages 388–395.

Philipp Koehn. 2005. Europarl: A parallel corpus for

5 Conclusion

statistical machine translation. In Proc. MT Summit,

15/10/15

2015©Akiva Miura AHC- ‐Lab, IS, NAIST

17

pages 79–86.

In this paper, we argued that constructing a trian-

gulated phrase table independently from even very

Tomas Mikolov, Kai Chen, Greg Corrado, and Je↵rey

limited source-target data (a small dictionary or

Dean. 2013. Efficient estimation of word represen-

tations in vector space. In Proc. ICLR, Workshop

parallel corpus) underutilizes that parallel data.

Track.

Following this argument, we designed a super-

vised learning algorithm that relies on word trans-

Masao Utiyama and Hitoshi Isahara. 2007. A com-

parison of pivot methods for phrase-based statistical

lation distributions derived from the parallel data

machine translation. In Proc. HLT-NAACL, pages

as well as a distributed representation of words

484–491.

(embeddings). The latter enables our algorithm to

Hua Wu and Haifeng Wang. 2007. Pivot language ap-

assign translation probabilities to word pairs that

proach for phrase-based statistical machine transla-

do not appear in the source-target bilingual data.

tion. In Proc. ACL, pages 856–863.

1083 - 5. Conclusion

In this paper:

l The authors argued that construc8ng a triangulated phrase table

independently from even very limited source- ‐target data underu;lizes

that parallel data

Ø They designed a supervised learning algorithm that relies on word

transla8ons distribu8ons derived from the parallel data as wel as a

distributed representa;on of words (embeddings)

Ø The laker enables their algorithm to assign transla8on probabili8es to

word pairs that do not appear in the source- ‐target bilingual data

l Model with the new lexical weights genera8on demonstrates

improvements in MT quality on two tasks despite the fact that wsup

were es8mated automa8cally or even naïvely as uniform distribu;ons

15/10/15

2015©Akiva Miura AHC- ‐Lab, IS, NAIST

18 - 6. Impression

15/10/15

2015©Akiva Miura AHC- ‐Lab, IS, NAIST

19 - End Slide

15/10/15

2015©Akiva Miura AHC- ‐Lab, IS, NAIST

20