このページは http://www.slideshare.net/KoichiAkabe/presentation-35532593 の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

- Presentation2年以上前 by Koichi Akabe

- Language Model

自然言語処理シリーズ 4 機械翻訳 pp.62-80

Koichi Akabe

MT study

NAIST

2014-05-08

2014-05-08

Koichi Akabe

(NAIST MT)

1 / 20 - −→ Syntax broken

−→ We have never seen

Which translation e is correct?

▶

e1 = he is big

▶

e2 = is big he

▶

e3 = this is a purple dog

We can know the answer without f

Fluency of Machine Translation

Machine Translation: f −→ e

2014-05-08

Koichi Akabe

(NAIST MT)

2 - −→ Syntax broken

−→ We have never seen

We can know the answer without f

Fluency of Machine Translation

Machine Translation: f −→ e

Which translation e is correct?

▶

e1 = he is big

▶

e2 = is big he

▶

e3 = this is a purple dog

2014-05-08

Koichi Akabe

(NAIST MT)

2 - −→ Syntax broken

−→ We have never seen

We can know the answer without f

Fluency of Machine Translation

Machine Translation: f −→ e

Which translation e is correct?

▶

e1 = he is big

▶

e2 = is big he

▶

e3 = this is a purple dog

2014-05-08

Koichi Akabe

(NAIST MT)

2 - −→ Syntax broken

−→ We have never seen

Fluency of Machine Translation

Machine Translation: f −→ e

Which translation e is correct?

▶

e1 = he is big

▶

e2 = is big he

▶

e3 = this is a purple dog

We can know the answer without f

2014-05-08

Koichi Akabe

(NAIST MT)

2 - Fluency of Machine Translation

Machine Translation: f −→ e

Which translation e is correct?

▶

e1 = he is big

▶

e2 = is big he −→ Syntax broken

▶

e3 = this is a purple dog −→ We have never seen

We can know the answer without f

2014-05-08

Koichi Akabe

(NAIST MT)

2 - Using this, we can compare sentences!

P (e = e1) > P (e = e3) > P (e = e2) ?

MT uses LM to increase translation accuracy

We call P (e) “language model probability”

Language model (LM)

Language model gives scores P (e) for each sentence without f

▶ P (e = he is big)

▶ P (e = is big he)

▶ P (e = this is a purple dog)

2014-05-08

Koichi Akabe

(NAIST MT)

3 - Language model (LM)

Language model gives scores P (e) for each sentence without f

▶ P (e = he is big)

▶ P (e = is big he)

▶ P (e = this is a purple dog)

Using this, we can compare sentences!

P (e = e1) > P (e = e3) > P (e = e2) ?

MT uses LM to increase translation accuracy

We call P (e) “language model probability”

2014-05-08

Koichi Akabe

(NAIST MT)

3 - Direct method: count frequency of sentences in the training data

ctrain(e)

P

∑

M L(e) =

e′ ctrain(e′)

Almost possible sentences are not contained in the training data

(−→ PML(e) = 0 for almost sentences)

Focus words to solve this problem

How to calculate P (e)?

We want to calculate probability of a sentence:

P (e = he is big)

2014-05-08

Koichi Akabe

(NAIST MT)

4 - Almost possible sentences are not contained in the training data

(−→ PML(e) = 0 for almost sentences)

Focus words to solve this problem

How to calculate P (e)?

We want to calculate probability of a sentence:

P (e = he is big)

Direct method: count frequency of sentences in the training data

ctrain(e)

P

∑

M L(e) =

e′ ctrain(e′)

2014-05-08

Koichi Akabe

(NAIST MT)

4 - Focus words to solve this problem

How to calculate P (e)?

We want to calculate probability of a sentence:

P (e = he is big)

Direct method: count frequency of sentences in the training data

ctrain(e)

P

∑

M L(e) =

e′ ctrain(e′)

Almost possible sentences are not contained in the training data

(−→ PML(e) = 0 for almost sentences)

2014-05-08

Koichi Akabe

(NAIST MT)

4 - How to calculate P (e)?

We want to calculate probability of a sentence:

P (e = he is big)

Direct method: count frequency of sentences in the training data

ctrain(e)

P

∑

M L(e) =

e′ ctrain(e′)

Almost possible sentences are not contained in the training data

(−→ PML(e) = 0 for almost sentences)

Focus words to solve this problem

2014-05-08

Koichi Akabe

(NAIST MT)

4 - First, we split variable e into words and text length I

P (I = 3, e1 = he, e2 = is, e3 = big)

To use uniform variable type, we replace I to eI = ⟨/s⟩

P (e1 = he, e2 = is, e3 = big, e4 = ⟨/s⟩)

We also add a prefix symbol for contexts (described later)

P (e0 = ⟨s⟩, e1 = he, e2 = is, e3 = big, e4 = ⟨/s⟩)

Rewrite P using words

P (e = he is big)

2014-05-08

Koichi Akabe

(NAIST MT)

5 - To use uniform variable type, we replace I to eI = ⟨/s⟩

P (e1 = he, e2 = is, e3 = big, e4 = ⟨/s⟩)

We also add a prefix symbol for contexts (described later)

P (e0 = ⟨s⟩, e1 = he, e2 = is, e3 = big, e4 = ⟨/s⟩)

Rewrite P using words

P (e = he is big)

First, we split variable e into words and text length I

P (I = 3, e1 = he, e2 = is, e3 = big)

2014-05-08

Koichi Akabe

(NAIST MT)

5 - We also add a prefix symbol for contexts (described later)

P (e0 = ⟨s⟩, e1 = he, e2 = is, e3 = big, e4 = ⟨/s⟩)

Rewrite P using words

P (e = he is big)

First, we split variable e into words and text length I

P (I = 3, e1 = he, e2 = is, e3 = big)

To use uniform variable type, we replace I to eI = ⟨/s⟩

P (e1 = he, e2 = is, e3 = big, e4 = ⟨/s⟩)

2014-05-08

Koichi Akabe

(NAIST MT)

5 - Rewrite P using words

P (e = he is big)

First, we split variable e into words and text length I

P (I = 3, e1 = he, e2 = is, e3 = big)

To use uniform variable type, we replace I to eI = ⟨/s⟩

P (e1 = he, e2 = is, e3 = big, e4 = ⟨/s⟩)

We also add a prefix symbol for contexts (described later)

P (e0 = ⟨s⟩, e1 = he, e2 = is, e3 = big, e4 = ⟨/s⟩)

2014-05-08

Koichi Akabe

(NAIST MT)

5 - Chain rule: P (e0 = ⟨s⟩, e1 = he, e2 = is, e3 = big, e4 = ⟨/s⟩)

=

P (e4 = ⟨/s⟩|e0 = ⟨s⟩, e1 = he, e2 = is, e3 = big)

×P (e3 = big|e0 = ⟨s⟩, e1 = he, e2 = is)

×P (e2 = is|e0 = ⟨s⟩, e1 = he)

×P (e1 = he|e0 = ⟨s⟩)×P (e0 = ⟨s⟩)

Generalize:

I+1

∏

I+1

∏ ctrain(ei )

P (eI

0

1) =

PML(ei|ei−1) =

0

c

)

i=1

i=1

train(ei−1

0

ej = e

i

i ei+1 · · · ej is a part of word sequence e0 e1 · · · eI+1

However, ctrain(ei ) becomes 0 for large i

0

Rewrite P using conditional probability P (word|context)

2014-05-08

Koichi Akabe

(NAIST MT)

6 - Generalize:

I+1

∏

I+1

∏ ctrain(ei )

P (eI

0

1) =

PML(ei|ei−1) =

0

c

)

i=1

i=1

train(ei−1

0

ej = e

i

i ei+1 · · · ej is a part of word sequence e0 e1 · · · eI+1

However, ctrain(ei ) becomes 0 for large i

0

Rewrite P using conditional probability P (word|context)

Chain rule: P (e0 = ⟨s⟩, e1 = he, e2 = is, e3 = big, e4 = ⟨/s⟩)

=

P (e4 = ⟨/s⟩|e0 = ⟨s⟩, e1 = he, e2 = is, e3 = big)

×P (e3 = big|e0 = ⟨s⟩, e1 = he, e2 = is)

×P (e2 = is|e0 = ⟨s⟩, e1 = he)

×P (e1 = he|e0 = ⟨s⟩)×P (e0 = ⟨s⟩)

2014-05-08

Koichi Akabe

(NAIST MT)

6 - However, ctrain(ei ) becomes 0 for large i

0

Rewrite P using conditional probability P (word|context)

Chain rule: P (e0 = ⟨s⟩, e1 = he, e2 = is, e3 = big, e4 = ⟨/s⟩)

=

P (e4 = ⟨/s⟩|e0 = ⟨s⟩, e1 = he, e2 = is, e3 = big)

×P (e3 = big|e0 = ⟨s⟩, e1 = he, e2 = is)

×P (e2 = is|e0 = ⟨s⟩, e1 = he)

×P (e1 = he|e0 = ⟨s⟩)×P (e0 = ⟨s⟩)

Generalize:

I+1

∏

I+1

∏ ctrain(ei )

P (eI

0

1) =

PML(ei|ei−1) =

0

c

)

i=1

i=1

train(ei−1

0

ej = e

i

i ei+1 · · · ej is a part of word sequence e0 e1 · · · eI+1

2014-05-08

Koichi Akabe

(NAIST MT)

6 - Rewrite P using conditional probability P (word|context)

Chain rule: P (e0 = ⟨s⟩, e1 = he, e2 = is, e3 = big, e4 = ⟨/s⟩)

=

P (e4 = ⟨/s⟩|e0 = ⟨s⟩, e1 = he, e2 = is, e3 = big)

×P (e3 = big|e0 = ⟨s⟩, e1 = he, e2 = is)

×P (e2 = is|e0 = ⟨s⟩, e1 = he)

×P (e1 = he|e0 = ⟨s⟩)×P (e0 = ⟨s⟩)

Generalize:

I+1

∏

I+1

∏ ctrain(ei )

P (eI

0

1) =

PML(ei|ei−1) =

0

c

)

i=1

i=1

train(ei−1

0

ej = e

i

i ei+1 · · · ej is a part of word sequence e0 e1 · · · eI+1

However, ctrain(ei ) becomes 0 for large i

0

2014-05-08

Koichi Akabe

(NAIST MT)

6 - n-gram model uses only n − 1 words as contexts:

I+1

∏

I+1

∏ ctrain(ei

)

P (eI

i−n+1

1) ≈

PML(ei|ei−1

) =

i−n+1

c

)

i=1

i=1

train(ei−1

i−n+1

n-gram model eases 0 probability problem

n-gram language model

So, we do not use long word sequences!

2014-05-08

Koichi Akabe

(NAIST MT)

7 - n-gram language model

So, we do not use long word sequences!

n-gram model uses only n − 1 words as contexts:

I+1

∏

I+1

∏ ctrain(ei

)

P (eI

i−n+1

1) ≈

PML(ei|ei−1

) =

i−n+1

c

)

i=1

i=1

train(ei−1

i−n+1

n-gram model eases 0 probability problem

2014-05-08

Koichi Akabe

(NAIST MT)

7 - Example of strict / 2-gram probabilities

Strict probability:

P (e = he is big)

=

PML(⟨/s⟩|⟨s⟩ he is big)

×PML(big|⟨s⟩ he is)

×PML(is|⟨s⟩ he)

×PML(he|⟨s⟩)

2-gram probability:

P (e = he is big)

≈ PML(⟨/s⟩|big)

×PML(big|is)

×PML(is|he)

×PML(he|⟨s⟩)

2014-05-08

Koichi Akabe

(NAIST MT)

8 - Basically, we calculate n-gram LM probability with (n − 1)-gram or

shorter contexts

Smoothing

Smoothing makes a robust LM for unknown linguistic phenomena

2014-05-08

Koichi Akabe

(NAIST MT)

9 - Smoothing

Smoothing makes a robust LM for unknown linguistic phenomena

Basically, we calculate n-gram LM probability with (n − 1)-gram or

shorter contexts

2014-05-08

Koichi Akabe

(NAIST MT)

9 - Give constant probability for unknown words:

1

P (ei) = (1 − α)PML(ei) + α |V|

where |V| is the vocabulary size

How to choose α?

Linear interpolation

Interpolate probability with shorter n-grams:

P (ei|ei−1

) = (1 − α)P

) + αP (e

)

i−n+1

M L(ei|ei−1

i−n+1

i|ei−1

i−n+2

For large α:

■■■■

■■■

For small α:

■■■■

■■■

2014-05-08

Koichi Akabe

(NAIST MT)

10 - How to choose α?

Linear interpolation

Interpolate probability with shorter n-grams:

P (ei|ei−1

) = (1 − α)P

) + αP (e

)

i−n+1

M L(ei|ei−1

i−n+1

i|ei−1

i−n+2

For large α:

■■■■

■■■

For small α:

■■■■

■■■

Give constant probability for unknown words:

1

P (ei) = (1 − α)PML(ei) + α |V|

where |V| is the vocabulary size

2014-05-08

Koichi Akabe

(NAIST MT)

10 - Linear interpolation

Interpolate probability with shorter n-grams:

P (ei|ei−1

) = (1 − α)P

) + αP (e

)

i−n+1

M L(ei|ei−1

i−n+1

i|ei−1

i−n+2

For large α:

■■■■

■■■

For small α:

■■■■

■■■

Give constant probability for unknown words:

1

P (ei) = (1 − α)PML(ei) + α |V|

where |V| is the vocabulary size

How to choose α?

2014-05-08

Koichi Akabe

(NAIST MT)

10 - ▶ “president was” may follow unknown words

−→ We cannot trust P (·|president was)

▶ “president ronald” frequently follow “reagan”

−→ We can trust P (·|president ronald)

Idea of Witten-Bell method

Table

Comparison of two n-gram contexts

president was

president ronald

elected

5

reagan

38

the

3

caza

1

in

3

venetiaan

1

first

3

· · ·

52 unique words, 110 times

3 unique words, 40 times

2014-05-08

Koichi Akabe

(NAIST MT)

11 - Idea of Witten-Bell method

Table

Comparison of two n-gram contexts

president was

president ronald

elected

5

reagan

38

the

3

caza

1

in

3

venetiaan

1

first

3

· · ·

52 unique words, 110 times

3 unique words, 40 times

▶ “president was” may follow unknown words

−→ We cannot trust P (·|president was)

▶ “president ronald” frequently follow “reagan”

−→ We can trust P (·|president ronald)

2014-05-08

Koichi Akabe

(NAIST MT)

11 - e.g.

u(president was, ·)

52

αpresident was =

=

u(president was, ·) + c(president was)

52 + 110

αpresident was = 0.32 −→ do not trust (see shorter contexts)

αpresident ronald = 0.07 −→ trust

Witten-Bell method

α depends on each reliability of n-gram:

u(ei−1

, ·)

α

=

i−n+1

ei−1

i−n+1

u(ei−1

, ·) + c(ei−1

)

i−n+1

i−n+1

P (ei|ei−1

) = (1 − α

)P

) + α

P (e

)

i−n+1

ei−1

M L(ei|ei−1

i−n+1

ei−1

i|ei−1

i−n+2

i−n+1

i−n+1

2014-05-08

Koichi Akabe

(NAIST MT)

12 - Witten-Bell method

α depends on each reliability of n-gram:

u(ei−1

, ·)

α

=

i−n+1

ei−1

i−n+1

u(ei−1

, ·) + c(ei−1

)

i−n+1

i−n+1

P (ei|ei−1

) = (1 − α

)P

) + α

P (e

)

i−n+1

ei−1

M L(ei|ei−1

i−n+1

ei−1

i|ei−1

i−n+2

i−n+1

i−n+1

e.g.

u(president was, ·)

52

αpresident was =

=

u(president was, ·) + c(president was)

52 + 110

αpresident was = 0.32 −→ do not trust (see shorter contexts)

αpresident ronald = 0.07 −→ trust

2014-05-08

Koichi Akabe

(NAIST MT)

12 - Absolute discounting method

Discount constant parameter d from frequencies of n-grams

max(ctrain(ei

)−d, 0)

P

i−n+1

d(ei|ei−1

) =

i−n+1

ctrain(ei−1

)

i−n+1

Ignore rare n-grams

1.0

0.5

d = 0.1

d = 0.5

d = 1.0

Ratio to normal counting

d = 2.0

0.00

1

2

3

4

5

6

7

8

9

10

2014-05-08

Koichi Akabe

(NAIST MT)

13 - e.g. α := 0.5 (normally decided to maximize likelihood of dev set)

38 − 0.5

Pd(reagan|president ronald) =

= 0.9375

40

1 − 0.5

Pd(caza|president ronald) =

= 0.0125

40

1 − 0.5

Pd(venetiaan|president ronald) =

= 0.0125

40

∑

αpresident ronald = 1 −

Pd(ei|president ronald) = 0.0375

ei

Absolute discounting method

Give discounted quantity for shorter n-grams

∑

α

= 1 −

P

)

ei−1

d(ei|ei−1

i−n+1

i−n+1

ei

P (ei|ei−1

) = P

) + α

P (e

)

i−n+1

d(ei|ei−1

i−n+1

ei−1

i|ei−1

i−n+2

i−n+1

2014-05-08

Koichi Akabe

(NAIST MT)

14 - Absolute discounting method

Give discounted quantity for shorter n-grams

∑

α

= 1 −

P

)

ei−1

d(ei|ei−1

i−n+1

i−n+1

ei

P (ei|ei−1

) = P

) + α

P (e

)

i−n+1

d(ei|ei−1

i−n+1

ei−1

i|ei−1

i−n+2

i−n+1

e.g. α := 0.5 (normally decided to maximize likelihood of dev set)

38 − 0.5

Pd(reagan|president ronald) =

= 0.9375

40

1 − 0.5

Pd(caza|president ronald) =

= 0.0125

40

1 − 0.5

Pd(venetiaan|president ronald) =

= 0.0125

40

∑

αpresident ronald = 1 −

Pd(ei|president ronald) = 0.0375

ei

2014-05-08

Koichi Akabe

(NAIST MT)

14 - Kneser and Ney used the unique counter u in absolute discounting

max(u(·, ei

) − d, 0)

P

i−n+1

kn(ei|ei−1

) =

i−n+1

u(·, ei−1

, ·)

i−n+1

∑

α

= 1 −

P

)

ei−1

kn(ei|ei−1

i−n+1

i−n+1

ei

Kneser-Ney method

Idea

“ronald reagan” or “president reagan” is frequently contained in

corpora, and normal smoothing methods give large probability for

“ronald” and “reagan”. However, “reagan” is not used in other

contexts.

2014-05-08

Koichi Akabe

(NAIST MT)

15 - Kneser-Ney method

Idea

“ronald reagan” or “president reagan” is frequently contained in

corpora, and normal smoothing methods give large probability for

“ronald” and “reagan”. However, “reagan” is not used in other

contexts.

Kneser and Ney used the unique counter u in absolute discounting

max(u(·, ei

) − d, 0)

P

i−n+1

kn(ei|ei−1

) =

i−n+1

u(·, ei−1

, ·)

i−n+1

∑

α

= 1 −

P

)

ei−1

kn(ei|ei−1

i−n+1

i−n+1

ei

2014-05-08

Koichi Akabe

(NAIST MT)

15 - Kneser-Ney method (example)

u(·, reagan) = 2

u(·, ronald reagan) = 10

u(·, ronald smith) = 1

u(·, ronald, ·) = 11

u(·, ·) = 2000

d = 0.5

max(u(·, ronald reagan) − d, 0)

Pkn(reagan|ronald) =

= 0.864

u(·, ronald, ·)

max(u(·, ronald smith) − d, 0)

Pkn(smith|ronald) =

= 0.045

u(·, ronald, ·)

max(u(·, reagan) − d, 0)

Pkn(reagan) =

= 0.00075

u(·, ·)

∑

αronald = 1 −

Pkn(ei|ronald) = 0.091

ei

2014-05-08

Koichi Akabe

(NAIST MT)

16 - Other methods

Additive smoothing

ctrain(ei

)+δ

P

i−n+1

d(ei|ei−1

) =

i−n+1

ctrain(ei−1

)+δ|W|

i−n+1

where |W| is a number of words (for normalization)

2014-05-08

Koichi Akabe

(NAIST MT)

17 - Other methods

Good-Turing (“Good” is a scientist)

Turing estimator uses revised values as a number of words:

Nr+1

r∗ = (r + 1) Nr

where Nr is a number of words occurring r times

If Nr = 0, r∗ becomes indeterminate form

Good-Turing estimator uses linear regression with Zipf’s law to

solve this problem

2Nr

Z

i

r :=

i

ri+1 − ri−1

where ri is ith non-zero number (r1 < r2 < r3 < · · · )

2014-05-08

Koichi Akabe

(NAIST MT)

18 - Other methods

Good-Turing (cont’d)

Estimate a and b:

log Zr ∼ a + b log r

i

i

(

)

Z

b+1

r+1

1

r∗ = (r + 1)

= r

1 +

Zr

r

2014-05-08

Koichi Akabe

(NAIST MT)

19 - Other methods

Back-oﬀ

Use shorter n-grams only when longer n-gram is not contained

Absolute discounting with back-oﬀ:

{Pd(ei|ei−1 )

(c(ei−1

) > 0)

P (e

i−n+1

i−n+1

i|ei−1

) =

i−n+1

β

P (e

)

(otherwise)

ei−1

i|ei−1

i−n+2

i−n+1

2014-05-08

Koichi Akabe

(NAIST MT)

20