このページは https://speakerdeck.com/sorami/tang-plus-2014-understanding-the-limiting-factors-of-topic-modelling-via-posterior-contraction-analysis の内容を掲載しています。

掲載を希望されないスライド著者の方は、削除申請よりご連絡下さい。

埋込み型プレイヤーを使用せず、常に元のサイトでご覧になりたい方は、自動遷移設定をご利用下さい。

4年弱前 (2014/08/20)にアップロードin学び

Jian Tang, Zhaoshi Meng, XuanLong Nguyen, Qiaozhu Mei and Ming Zhang.

"Understanding the Limitin...

Jian Tang, Zhaoshi Meng, XuanLong Nguyen, Qiaozhu Mei and Ming Zhang.

"Understanding the Limiting Factors of Topic Modelling via Posterior Contraction Analysis"

31st International Conference on Machine Learning (ICML), Beijing, June 2014.

http://jmlr.org/proceedings/papers/v32/tang14.pdf

- Understanding

the Limiting Factors of Topic Modeling

via Posterior Contraction Analysis

Jian Tang, Zhaoshi Meng, XuanLong Nguyen, Qiaozhu Mei and Ming Zhang.

31st International Conference on Machine Learning (ICML), Beijing, June 2014.

Sorami Hisamoto

August 20, 2014. - Summary

1. Theoretical results to explain the convergence behavior of LDA.

๏ “How does posterior converge as data increases?”

๏ Limiting factors: number of documents, length of docs, number of topics, …

2. Empirical study to support the theory.

๏ Synthetic data: various settings e.g. number of docs / topics, length of docs, …

๏ Real data sets: Wikipedia, the New York Times, and Twitter.

3. Guidelines for the practical use of LDA.

๏ Number of docs, length of docs, number of topics

๏ Topic / document separation, Dirichlet parameters, …

2 - Summary

1. Theoretical results to explain the convergence behavior of LDA.

๏ “How does posterior converge as data increases?”

๏ Limiting factors: number of documents, length of docs, number of topics, …

2. Empirical study to support the theory.

๏ Synthetic data: various settings e.g. number of docs / topics, length of docs, …

๏ Real data sets: Wikipedia, the New York Times, and Twitter.

3. Guidelines for the practical use of LDA.

๏ Number of docs, length of docs, number of topics

๏ Topic / document separation, Dirichlet parameters, …

2 - Summary

1. Theoretical results to explain the convergence behavior of LDA.

๏ “How does posterior converge as data increases?”

๏ Limiting factors: number of documents, length of docs, number of topics, …

2. Empirical study to support the theory.

๏ Synthetic data: various settings e.g. number of docs / topics, length of docs, …

๏ Real data sets: Wikipedia, the New York Times, and Twitter.

3. Guidelines for the practical use of LDA.

๏ Number of docs, length of docs, number of topics

๏ Topic / document separation, Dirichlet parameters, …

2

1. Theoretical results to explain the convergence behavior of LDA.

๏ “How does posterior converge as data increases?”

๏ Limiting factors: number of documents, length of docs, number of topics, …

2. Empirical study to support the theory.

๏ Synthetic data: various settings e.g. number of docs / topics, length of docs, …

๏ Real data sets: Wikipedia, the New York Times, and Twitter.

3. Guidelines for the practical use of LDA.

๏ Number of docs, length of docs, number of topics

๏ Topic / document separation, Dirichlet parameters, …

2- Posterior

Topic

Empirical

Contraction

Modeling

Study

Analysis - Posterior

Topic

Empirical

Contraction

Modeling

Study

Analysis - What is topic modeling?

๏ Modeling latent “topics” of each data.

๏ A lot of applications. Not limited to text.

๏ LDA: the basic topic model (next slide)

Topics!

e.g. word distribution

Data

e.g. document

Figure from [Blei+ 2003]

5 - latent Dirichlet allocation (LDA) [Blei+ 2003]

๏ It assumes that “each document consists of multiple topics”.

๏ “Topic” is defined as a distribution over a fixed vocabulary.

6 - latent Dirichlet allocation (LDA) [Blei+ 2003]

๏ It assumes that “each document consists of multiple topics”.

๏ “Topic” is defined as a distribution over a fixed vocabulary.

Two-stage generation process for each document

1. Randomly choose a distribution over topics.

2. For each word in the document

a) Randomly choose a topic from the distribution over topic in step #1.

b) Randomly choose a word from the corresponding topic.

6 - latent Dirichlet allocation (LDA) [Blei+ 2003]

๏ It assumes that “each document consists of multiple topics”.

๏ “Topic” is defined as a distribution over a fixed vocabulary.

Two-stage generation process for each document

1. Randomly choose a distribution over topics.

2. For each word in the document

a) Randomly choose a topic from the distribution over topic in step #1.

b) Randomly choose a word from the corresponding topic.

6 - latent Dirichlet allocation (LDA) [Blei+ 2003]

๏ It assumes that “each document consists of multiple topics”.

๏ “Topic” is defined as a distribution over a fixed vocabulary.

Two-stage generation process for each document

1. Randomly choose a distribution over topics.

2. For each word in the document

a) Randomly choose a topic from the distribution over topic in step #1.

b) Randomly choose a word from the corresponding topic.

6 - Figure from [Blei 2011]

7 - Topic:

distribution

over vocabulary

Figure from [Blei 2011]

7 - Topic:

distribution

over vocabulary

Step 1:

Choose a

distribution over topics

Figure from [Blei 2011]

7 - Topic:

distribution

over vocabulary

Step 2a:

Choose a topic

from distribution

Step 1:

Choose a

distribution over topics

Figure from [Blei 2011]

7 - Topic:

distribution

over vocabulary

Step 2a:

Choose a topic

from distribution

Step 2b:

Choose a word

from topic

Step 1:

Choose a

distribution over topics

Figure from [Blei 2011]

7 - Graphical model representation

Figures from [Blei 2011]

8 - Graphical model representation

topic

Figures from [Blei 2011]

8 - Graphical model representation

topic proportion

topic

Figures from [Blei 2011]

8 - Graphical model representation

topic proportion

topic

topic

assignment

Figures from [Blei 2011]

8 - Graphical model representation

topic proportion

observed word

topic

topic

assignment

Figures from [Blei 2011]

8 - Graphical model representation

topic proportion

observed word

topic

topic

assignment

Joint distribution of hidden and observed variables

Figures from [Blei 2011]

8 - Graphical model representation

topic proportion

observed word

topic

topic

assignment

Joint distribution of hidden and observed variables

Figures from [Blei 2011]

8 - Graphical model representation

topic proportion

observed word

topic

topic

assignment

Joint distribution of hidden and observed variables

Figures from [Blei 2011]

8 - Geometric interpretation

Figure from [Blei+ 2003]

9 - Geometric interpretation

Topic:

in word simplex

Figure from [Blei+ 2003]

9 - Geometric interpretation

Topic:

in word simplex

θ

Step 1:

Choose a

distribution over topics

Figure from [Blei+ 2003]

9 - Geometric interpretation

Topic:

in word simplex

Step 2a:

Choose a topic

from distribution

θ

Z

Step 1:

Choose a

distribution over topics

Figure from [Blei+ 2003]

9 - Geometric interpretation

Topic:

in word simplex

Step 2a:

Choose a topic

from distribution

θ

W

Z

Step 1:

Step 2b:

Choose a

Choose a word

distribution over topics

from topic

Figure from [Blei+ 2003]

9 - Geometric interpretation

Topic:

in word simplex

LDA:

finding the optimal sub-simplex

to represent documents.

Step 2a:

Choose a topic

from distribution

θ

W

Z

Step 1:

Step 2b:

Choose a

Choose a word

distribution over topics

from topic

Figure from [Blei+ 2003]

9 - Geometric interpretation

Topic:

in word simplex

LDA:

finding the optimal sub-simplex

to represent documents.

!

Step 2a:

Choose a topic

!

from distribution sub-simplex

θ

W

Z

Step 1:

Step 2b:

Choose a

Choose a word

distribution over topics

from topic

Figure from [Blei+ 2003]

9 - “reverse” the generation process

๏ We are interested in the posterior distribution.

๏ latent topic structure, given the observed documents.

!

!

๏ But it is difficult … → approximate:

๏ 1. Sampling-based methods (e.g. Gibbs sampling)

๏ 2. Variational methods (e.g. variational Bayes)

๏ etc…

10 - “reverse” the generation process

๏ We are interested in the posterior distribution.

๏ latent topic structure, given the observed documents.

!

!

๏ But it is difficult … → approximate:

๏ 1. Sampling-based methods (e.g. Gibbs sampling)

๏ 2. Variational methods (e.g. variational Bayes)

๏ etc…

10 - “reverse” the generation process

๏ We are interested in the posterior distribution.

๏ latent topic structure, given the observed documents.

!

!

๏ But it is difficult … → approximate:

๏ 1. Sampling-based methods (e.g. Gibbs sampling)

๏ 2. Variational methods (e.g. variational Bayes)

๏ etc…

10

๏ We are interested in the posterior distribution.

๏ latent topic structure, given the observed documents.

!

!

๏ But it is difficult … → approximate:

๏ 1. Sampling-based methods (e.g. Gibbs sampling)

๏ 2. Variational methods (e.g. variational Bayes)

๏ etc…

10- FAQs on LDA

๏ Is my data topic-model “friendly”?

๏ Why did the LDA fail on my data?

๏ How many documents do I need to learn 100 topics?

!

๏ Machine learning folklores …

11 - FAQs on LDA

๏ Is my data topic-model “friendly”?

๏ Why did the LDA fail on my data?

๏ How many documents do I need to learn 100 topics?

!

๏ Machine learning folklores …

11 - Posterior

Topic

Empirical

Contraction

Modeling

Study

Analysis - Posterior

Topic

Empirical

Contraction

Modeling

Study

Analysis - Convergence behavior of the posterior

๏ How does posterior convergence behavior change, as data increases?

๏ → Introduces a metric which describes the contracting neighborhood

centred at the true topic values, where the posterior distribution will be

shown to place most its probability mass on.

๏ The faster the contraction, the more eﬃcient the statistical inference.

14 - … but it’s diﬃcult to see individual topics

๏ Issue of identifiability

๏ “label-switching” issue: one can only identify the topic collection up to a

permutation.

๏ Any vector that can be expressed as a convex combination of the topic

parameters would be hard to identify and analyze.

15 - Latent topic polytype in the LDA

Topic Polytope: convex hull of the topics

Figures from [Tang+ 2014]

16 - Latent topic polytype in the LDA

Topic Polytope: convex hull of the topics

topics

Figures from [Tang+ 2014]

16 - Latent topic polytype in the LDA

Topic Polytope: convex hull of the topics

topics

Distance between two polytopes: “minimum-matching” Euclidean

Figures from [Tang+ 2014]

16 - Latent topic polytype in the LDA

Topic Polytope: convex hull of the topics

topics

Distance between two polytopes: “minimum-matching” Euclidean

* Intuitively, this metric is a stable measure of the dissimilarity between two topic polytopes.

Figures from [Tang+ 2014]

16 - Geometric interpretation

Figure from [Blei+ 2003]

17 - Geometric interpretation

!

!

Topic!

Polytope

Figure from [Blei+ 2003]

17 - Upper bound for the learning rate

G*: true topic polytope

K*: true number of topics

D: number of documents

N: length of documents

Figures from [Tang+ 2014]

18 - Upper bound for the learning rate

G*: true topic polytope

K*: true number of topics

D: number of documents

N: length of documents

Figures from [Tang+ 2014]

18 - Upper bound for the learning rate

G*: true topic polytope

K*: true number of topics

D: number of documents

N: length of documents

Figures from [Tang+ 2014]

18

G*: true topic polytope

K*: true number of topics

D: number of documents

N: length of documents

Figures from [Tang+ 2014]

18- Observations from the theorem 1

๏ From (3), we should have log D < N (length of documents should be at least on the order of

log D, up to a constant factor).

๏ From empirical study, the last term of (5) does not appear to play a noticeable role → 3rd term

may be an artefact due to the proof technique?

๏ In practice the actual rate could be faster than the given upper bound. However, this

looseness of the upper bound only occurs in the exponent, the dependence of 1/D and 1/N

should remain due to a lower bound → Sec.3.1.4. & [Nguyen 2012]

๏ Condition A2: well-separated topics → small β.

๏ Convergence rate does not depend on the number of topics K → once K is known, or topics

are well-separated, the LDA inference is statistically efficient.

๏ In practice we do not know K*: while under fitting will result in a persistent error even with

infinite amount of data, we are most likely to prefer the over-fitted setting (K>>K*).

19 - Observations from the theorem 1

๏ From (3), we should have log D < N (length of documents should be at least on the order of

log D, up to a constant factor).

๏ From empirical study, the last term of (5) does not appear to play a noticeable role → 3rd term

may be an artefact due to the proof technique?

๏ In practice the actual rate could be faster than the given upper bound. However, this

looseness of the upper bound only occurs in the exponent, the dependence of 1/D and 1/N

should remain due to a lower bound → Sec.3.1.4. & [Nguyen 2012]

๏ Condition A2: well-separated topics → small β.

๏ Convergence rate does not depend on the number of topics K → once K is known, or topics

are well-separated, the LDA inference is statistically efficient.

๏ In practice we do not know K*: while under fitting will result in a persistent error even with

infinite amount of data, we are most likely to prefer the over-fitted setting (K>>K*).

19 - Observations from the theorem 1

๏ From (3), we should have log D < N (length of documents should be at least on the order of

log D, up to a constant factor).

๏ From empirical study, the last term of (5) does not appear to play a noticeable role → 3rd term

may be an artefact due to the proof technique?

๏ In practice the actual rate could be faster than the given upper bound. However, this

looseness of the upper bound only occurs in the exponent, the dependence of 1/D and 1/N

should remain due to a lower bound → Sec.3.1.4. & [Nguyen 2012]

๏ Condition A2: well-separated topics → small β.

๏ Convergence rate does not depend on the number of topics K → once K is known, or topics

are well-separated, the LDA inference is statistically efficient.

๏ In practice we do not know K*: while under fitting will result in a persistent error even with

infinite amount of data, we are most likely to prefer the over-fitted setting (K>>K*).

19

๏ From (3), we should have log D < N (length of documents should be at least on the order of

log D, up to a constant factor).

๏ From empirical study, the last term of (5) does not appear to play a noticeable role → 3rd term

may be an artefact due to the proof technique?

๏ In practice the actual rate could be faster than the given upper bound. However, this

looseness of the upper bound only occurs in the exponent, the dependence of 1/D and 1/N

should remain due to a lower bound → Sec.3.1.4. & [Nguyen 2012]

๏ Condition A2: well-separated topics → small β.

๏ Convergence rate does not depend on the number of topics K → once K is known, or topics

are well-separated, the LDA inference is statistically efficient.

๏ In practice we do not know K*: while under fitting will result in a persistent error even with

infinite amount of data, we are most likely to prefer the over-fitted setting (K>>K*).

19

๏ From (3), we should have log D < N (length of documents should be at least on the order of

log D, up to a constant factor).

๏ From empirical study, the last term of (5) does not appear to play a noticeable role → 3rd term

may be an artefact due to the proof technique?

๏ In practice the actual rate could be faster than the given upper bound. However, this

looseness of the upper bound only occurs in the exponent, the dependence of 1/D and 1/N

should remain due to a lower bound → Sec.3.1.4. & [Nguyen 2012]

๏ Condition A2: well-separated topics → small β.

๏ Convergence rate does not depend on the number of topics K → once K is known, or topics

are well-separated, the LDA inference is statistically efficient.

๏ In practice we do not know K*: while under fitting will result in a persistent error even with

infinite amount of data, we are most likely to prefer the over-fitted setting (K>>K*).

19

๏ From (3), we should have log D < N (length of documents should be at least on the order of

log D, up to a constant factor).

๏ From empirical study, the last term of (5) does not appear to play a noticeable role → 3rd term

may be an artefact due to the proof technique?

๏ In practice the actual rate could be faster than the given upper bound. However, this

looseness of the upper bound only occurs in the exponent, the dependence of 1/D and 1/N

should remain due to a lower bound → Sec.3.1.4. & [Nguyen 2012]

๏ Condition A2: well-separated topics → small β.

๏ Convergence rate does not depend on the number of topics K → once K is known, or topics

are well-separated, the LDA inference is statistically efficient.

๏ In practice we do not know K*: while under fitting will result in a persistent error even with

infinite amount of data, we are most likely to prefer the over-fitted setting (K>>K*).

19- Theorem for general situations

When neither condition A1 nor A2 in theorem 1 holds:

20 - Theorem for general situations

When neither condition A1 nor A2 in theorem 1 holds:

Upper bound deteriorates with K.

20 - Theorem for general situations

When neither condition A1 nor A2 in theorem 1 holds:

Upper bound deteriorates with K.

c.f. [Nguyen 2012] for more detail.

20 - Posterior

Topic

Empirical

Contraction

Modeling

Study

Analysis - Posterior

Topic

Empirical

Contraction

Modeling

Study

Analysis - Empirical study: metrics

Distance between two polytopes:

“minimum-matching” Euclidean

23 - Empirical study: metrics

Distance between two polytopes:

“minimum-matching” Euclidean

When the number of vertices of polytope in general positions is smaller than

the number of dimensions, all such vertices are also the extreme points of their convex hull.

23 - Empirical study: metrics

Distance between two polytopes:

“minimum-matching” Euclidean

When the number of vertices of polytope in general positions is smaller than

the number of dimensions, all such vertices are also the extreme points of their convex hull.

23 - Experiments on synthetic data

๏ Create synthetic data set by LDA generative process.

๏ Default settings:

๏ true number of topics K*: 3

๏ vocabulary size |V|: 5,000

๏ symmetric Dirichlet prior for topic proportions: 1

๏ symmetric Dirichlet prior for word distributions: 0.01

๏ Model inference: collapsed Gibbs sampling.

๏ Learning error: posterior mean of the metric.

๏ Reported results: averaged over 30 simulations.

24 - Scenario I: fixed N and increasing D

๏ N=500

๏ D=10~7,000

๏ K

๏ =3=K*: exact fitted

๏ =10: over-fitted

๏ β

๏ =0.01: (well-separated topics)

Main varying term

(compared in graphs)

๏ =1: (more word-diffuse, less distinguishable topics)

25 - Scenario I: fixed N and increasing D

β = 0.01

β = 1

๏ N=500

๏ D=10~7,000

1. Same β but different K:

๏ K

When LDA is over-fitted (i.e. K > K*),

๏ =3=K*

the perfor : exact fitted

mance degenerates

๏ =10: over-fitted

significantly.

๏ β

๏ =0.01: (well-separated topics)

Main varying term

(compared in graphs)

๏ =1: (more word-diffuse, less distinguishable topics)

25 - Scenario I: fixed N and increasing D

K = K*

K > K* K = K*

K > K*

๏ N=500

2. Same K but different β:

๏ D=10~7,000

When β larger, the error curves decay

1. Same β but different K:

๏ K

faster when less data is available.

When LDA is over-fitted (i.e. K > K*),

๏ =3=K*

As more data available: becomes

the perfor : exact fitted

mance degenerates

slower, then flats out.

๏ =10: over-fitted

significantly.

By contrast, small β results in a more

๏ β

efficient learning rate.

๏ =0.01: (well-separated topics)

Main varying term

(compared in graphs)

๏ =1: (more word-diffuse, less distinguishable topics)

25 - Scenario I: fixed N and increasing D

K = K*

K > K* K = K*

K > K*

๏ N=500

2. Same K but different β:

๏ D=10~7,000

When β larger, the error curves decay

3. K=K*:!

1. Same β but different K:

๏ K

faster when less data is available.

Error rate seems to much (logD/D)^0.5

When LDA is over-fitted (i.e. K > K*),

๏ =3=K*

As more data available: becomes

quite well.

the perfor : exact fitted

mance degenerates

slower, then flats out.

In overfitted case, rate is slower.!

๏ =10: over-fitted

significantly.

By contrast, small β results in a more

๏ β

efficient learning rate.

๏ =0.01: (well-separated topics)

Main varying term

(compared in graphs)

๏ =1: (more word-diffuse, less distinguishable topics)

25 - Scenario II: fixed D and increasing N

๏ N=10~1,400

๏ D=1,000

๏ K

๏ =3=K*: exact fitted

๏ =5: over-fitted

๏ β

๏ =0.01: (well-separated topics)

Main varying term

(compared in graphs)

๏ =1: (more word-diffuse, less distinguishable topics)

26 - Scenario II: fixed D and increasing N

K > K*

K > K*

๏ N=10~1,400

๏ D=1,000

Behavior similar to Scenario I.

๏ K

!

๏ =3=K*: exact fitted

In over-fitted cases (K>K*),

๏ =5: over-fitted

error fails to vanish even N becomes large.

๏ β

Possibly due to log D / D in the upper bound.

๏ =0.01: (well-separated topics)

Main varying term

(compared in graphs)

๏ =1: (more word-diffuse, less distinguishable topics)

26 - Scenario III: N=D, both increasing

๏ N=D: 10~1,300

๏ K={3, 5}

๏ β={0.01, 1}

27 - Scenario III: N=D, both increasing

K >K*

β = 1

Similar to previous scenarios, LDA most

effective in the exact-fitted setting

๏ N=D: 10~1,300

(K=K*) & topics are sparse (β small).

!

๏ K={3, 5}

When both conditions fail, the error rate

fails to converge to zero, even if data

๏ β={0.01, 1}

size D=N increases.

27 - Scenario III: N=D, both increasing

Similar to previous scenarios, LDA most

effective in the exact-fitted setting

๏ N=D: 10~1,300

(K=K*) & topics are sparse (β small).

!

๏ K={3, 5}

When both conditions fail, the error rate

fails to converge to zero, even if data

๏ β={0.01, 1}

size D=N increases.

27 - Scenario III: N=D, both increasing

Empirical error decays at a faster rate than indicated by the upper

Similar to previous scenarios, LDA most

bound (logD/D)^0.5 from Thm. 1.

effective in the exact-fitted setting

๏

!

N=D: 10~1,300

(K=K*) & topics are sparse (β small).

Rough estimate could be Ω(1/D), which actually matches the theoretical

!

๏ K={3, 5}

lower bound of the error contraction rate (cf. Thm. 3 in [Nguyen 2012]).

When both conditions fail, the error rate

!

fails to converge to zero, even if data

๏ β={0.01, 1}

This suggests that the upper bound given in Thm. 1 could be quite

size D=N increases.

conservative in certain configurations and scenarios.

27 - Exponential exponents of the error rate

๏ 2 scenarios:

๏ Fixed N=5 and increasing D.

๏ D=N and both increasing.

28 - Exponential exponents of the error rate

K = K*

K = K*

๏ 2 scenarios:

๏ Fixed N=5 and increasing D.

๏ D=N and both increasing.

28 - Exponential exponents of the error rate

K = K*

K > K* K = K*

K > K*

๏ 2 scenarios:

Exact-fitted (K=K*)!

๏ Fixed N=5 and incr

Slope of the log err

easing D.

or seems close to 1

→ matches the lower bound Ω(1/D)

๏ D=N and both increasing.

28 - Exponential exponents of the error rate

K = K*

K > K* K = K*

K > K*

๏ 2 scenarios:

Over-fitted (K>K*)!

Exact-fitted (K=K*)!

Slope tend toward the range bounded

๏ Fixed N=5 and incr

Slope of the log err

easing D.

or seems close to 1

by 1/2K = 0.1 and 2/K = 0.4

→ matches the lower bound Ω(1/D)

→ approximations of the exponents of

๏ D=N and both increasing.

lower/upper bound by theory.

28 - Experiments on real data sets

๏ Wikipedia, the New York Times articles, and Twitter.

๏ To test the effects of the four limiting factors: N, D, α, β.

๏ Ground-truth topics unknown → use PMI or perplexity.

29 - Fixed D,

Fixed N,

Fixed N&D,

Fixed N&D,

increasing N

increasing D

varying α

varying β

Wikipedia

New York Times

Twitter

30 - Fixed D,

Fixed N,

Fixed N&D,

Fixed N&D,

increasing N

increasing D

varying α

varying β

Wikipedia

Results consistent with theory & empirical analysis on synthetic data.

!

With extreme data (e.g. very short or very few),

or when hyper parameters not appropriately set,

performance suffers.

New York Times

!

Results suggesting favorable ranges of parameters:

small β,

small α (Wikipedia) or large α (NYT, Twitter).

Twitter

30 - Implications and guidelines: 1 & 2

1. Number of documents: D

๏ Impossible to guarantee identification of topics from small D, no matter how long.

๏ Once sufficiently large D, further increase may not significantly improve the result, unless N also

suitably increased.

๏ In practice, the LDA achieves comparable results even if thousands of documents are sampled

from much larger collection.

2. Length of document: N

๏ Poor result expected when N small, even if D is large.

๏ Ideally, N need to be sufficiently long, but need not too long.

๏ In practice, for very long documents, one can sample fraction of each document and the LDA

still yields comparable topics.

31 - Implications and guidelines: 3, 4, & 5

3. Number of topics: K

๏ If K > K*, inference may become inefficient.

๏ In theory, the convergence rate deteriorates quickly to a nonparametric rate, depending on the

number of topics used to fit LDA → Need to be careful not to use too large K.

4. Topic / document separation: LDA performs well when …

๏ Topics are well-separated.

๏ Individual documents area associated mostly with small subset of topics.

5. Hyperparameters

๏ It you think each documents associated with few topics, set α small (e.g. 0.1).

๏ If the topics are known to be word-sparse, set β small (e.g. 0.01) → more efficient learning.

32 - Limitations of existing results

1. Geometrically intuitive assumptions

๏ e.g. in reality we don’t know how separate the topics are, and whether

their convex hull is geometrically degenerate or not.

๏ → may be beneficial to impose additional geometric constraints on prior.

2. True / approximated posterior

๏ Here we considered true posterior distribution.

๏ In practice, posterior is obtained by approximation techniques → error.

33 - To summarize …

1. Theoretical results to explain the convergence behavior of LDA.

๏ “How does posterior converge as data increases?”

๏ Limiting factors: number of documents, length of docs, number of topics, …

2. Empirical study to support the theory.

๏ Synthetic data: various settings e.g. number of docs / topics, length of docs, …

๏ Real data sets: Wikipedia, the New York Times, and Twitter.

3. Guidelines for the practical use of LDA.

๏ Number of docs, length of docs, number of topics

๏ Topic / document separation, Dirichlet parameters, …

34 - Some references (1)

๏ [Blei&Laﬀerty 2009] Topic Models

http://www.cs.princeton.edu/~blei/papers/BleiLaﬀerty2009.pdf

๏ [Blei 2011] Introduction to Probabilistic Topic Models

https://www.cs.princeton.edu/~blei/papers/Blei2011.pdf

๏ [Blei 2012] Review Articles: Probabilistic Topic Models

Communications of The ACM

http://www.cs.princeton.edu/~blei/papers/Blei2012.pdf

๏ [Blei 2012] Probabilistic Topic Models

Machine Learning Summer School

http://www.cs.princeton.edu/~blei/blei-mlss-2012.pdf

๏ Topic Models by David Blei (video)

https://www.youtube.com/watch?v=DDq3OVp9dNA

๏ What is a good explanation of Latent Dirichlet Allocation? - Quora

http://www.quora.com/What-is-a-good-explanation-of-Latent-Dirichlet-Allocation

๏ The LDA Buﬀet is Now Open; or, Latent Dirichlet Allocation for English Majors by Matthew L. Jockers

http://www.matthewjockers.net/2011/09/29/the-lda-buﬀet-is-now-open-or-latent-dirichlet-allocation-for-english-majors/

๏ [持橋&石黒 2013] 確率的トピックモデル

統計数理研究所 H24年度公開講座

http://www.ism.ac.jp/~daichi/lectures/ISM-2012-TopicModels-daichi.pdf

35 - Some references (2)

๏ [Blei+ 2003] Latent Dirichlet Allocation

Journal of Machine Learning Research

http://machinelearning.wustl.edu/mlpapers/paper_files/BleiNJ03.pdf

๏ [Nguyen 2012] Posterior contraction of the population polytope in finite admixture models

arXiv preprint arXiv:1206.0068

http://arxiv.org/abs/1206.0068

๏ [Tang+ 2014] Understanding the Limiting Factors of Topic Modeling via Posterior Contraction

Analysis

Proceedings of the 31st International Conference on Machine Learning (ICML)

http://jmlr.org/proceedings/papers/v32/tang14.pdf

36