Efficient Methods for Incorporating Knowledge into Topic Models [Yang, Downey and Boyd-Graber 2015] 2015/10/24 EMNLP 2015 Reading @shuyo
Large-scale Topic Model • In academic papers – Up to 10^3 topics • Industrial applications – 10^5~10^6 topics! really? – Search engines, online ads. and so on – To capture infrequent topics • This paper handles up to 500 topics...
(Standard) LDA [Blei+ 2003, Griffiths+ 2004] • "Conventional" Gibbs sampling 𝑛 𝑃 𝑧 = 𝑡 𝒛 𝑤,𝑡 + 𝛽 −, 𝑤 ∝ 𝑞𝑡 ≔ 𝑛𝑑,𝑡 + 𝛼 𝑛𝑡 + 𝑉𝛽 – 𝑇 : Topic size – For 𝑈~𝒰 0, 𝑇 , find 𝑡−1 𝑡 𝑧 𝑞𝑧 𝑡 s.t. 𝑧 𝑞𝑧 < 𝑈 < 𝑧 𝑞𝑧 • For large T, it is computationally intensive – 𝑛 is sparse 𝑤,𝑡 – When T is very large, 𝑛 is too e.g. 𝑇 = 106 > 𝑛𝑑 𝑑,𝑡
Leveraging Prior Knowledge • The objective function of topic models does not correlate with human judgements
Word correlation prior knowledge • Must-link – “quarterback” and “fumble” are both related to American football • Cannot-link – “fumble” and “bank” imply two different topics
SC-LDA [Yang+ 2015] Sparse Constrained • 𝑚 ∈ 𝑀 : Prior knowledge • 𝑓𝑚(𝑧, 𝑤, 𝑑) : Potential function of prior knowledge 𝑚 about word 𝑤 with topic 𝑧 in document 𝑑 maybe 𝑚 ∈ 𝑀, all 𝑤 with 𝑧 in all 𝑑 • 𝜓 𝒛, 𝑀 = 𝑧∈𝒛 exp 𝑓𝑚 𝑧, 𝑤, 𝑑 • 𝑃 𝒘, 𝒛 𝛼, 𝛽, 𝑀 = 𝑃 𝒘 𝒛, 𝛽 𝑃 𝒛 𝛼 𝜓(𝒛, 𝑀) maybe ∝
Inference for SC-LDA 𝑉
Word correlation prior knowledge for SC-LDA • 𝑓𝑚 𝑧, 𝑤, 𝑑 = 1 log max 𝜆, 𝑛𝑢,𝑧 + log max 𝜆,𝑛 𝑢∈𝑀𝑚 𝑐 𝑣,𝑧 𝑤 𝑣∈𝑀𝑤 – where 𝑀𝑚 : Must-link of 𝑐 : Cannot-link of 𝑤 𝑤, 𝑀𝑤 𝑤 • 𝛼𝛽 𝑛 𝑛 𝑃 𝑧 = 𝑡 𝒛 𝑑,𝑡𝛽 𝑑,𝑡+𝛼 𝑛𝑤,𝑡 −, 𝑤, 𝑀 ∝ + + 𝑛𝑡+𝑉𝛽 𝑛𝑡+𝑉𝛽 𝑛𝑡+𝑉𝛽 1 max 𝜆, 𝑛𝑢,𝑧 max 𝜆,𝑛 𝑢∈𝑀𝑚 𝑐 𝑣,𝑧 𝑤 𝑣∈𝑀𝑤
Factor Graph • They tell that prior knowledge is incorporated “by adding a factor graph to encode prior knowledge,” but it does not be drawn. • The potential function 𝑓 , 𝑚 𝑧, 𝑤, 𝑑 contains 𝑛𝑤,𝑧 and 𝜑𝑤,𝑧 ∝ 𝑛𝑤,𝑧 + 𝛽. • So the above model seems like Fig.b: Fig.a Fig.b
[Ramage+ 2009] Labeled LDA • Supervized LDA for labeled documents – It is equivalent to SC-LDA with the following potential function 1, if 𝑧 ∈ 𝑚 𝑓 𝑑 𝑚 𝑧, 𝑤, 𝑑 = −∞, else where 𝑚 specifies a label set of 𝑑 𝑑
Experiments • Baselines – Dirichlet Forest-LDA [Andrzejewski+ 2009] – Logic-LDA [Andrzejewski+ 2011] – MRF-LDA [Xie+ 2015] • Encodes word correlations in LDA as MRF – SparseLDA DATASET DOCS TYPE TOKEN(APPROX) Experiments NIPS 1,500 12,419 1,900,000 Word correlation NYT-NEWS 3,000,000 102,660 100,000,000 20NG 18,828 21,514 1,946,000 Labeled docs
Generate Word Correlation • Must-link – Obtain synsets from WordNet 3.0 – Similarity between the word and its synsets on word embedding from word2vec is higher than threshold 0.2 • Cannot-link – Nothing?
Convergence Speed The average running time per iteration over 100 iterations, averaged over 5 seeds, on 20NG dataset.
Coherence [Mimno+ 2011] 𝜖 is very small like 10−12 [Röder+ 2015] 𝐹 𝑣 𝑡 ,𝑣 𝑡 +𝜖 • 𝐶 𝑡: 𝑉 𝑡 = 𝑀 𝑚−1 𝑚 𝑙 It means 𝑚=2 𝑙=1 log 𝐹 𝑣 𝑡 “include”? 𝑙 – 𝐹 𝑣 : document frequency of word type 𝑣 – 𝐹 𝑣, 𝑣′ :co-document frequency of word type 𝑣, 𝑣′ -39.1 -36.6
References • [Yang+ 2015] Efficient Methods for Incorporating Knowledge into Topic Models • [Blei+ 2003] Latent Dirichlet allocation. • [Griffiths+ 2004] Finding scientific topics. • [Yao+ 2009] Efficient methods for topic model inference on streaming document collections. • [Ramage+ 2009] Labeled LDA: A supervised topic model for credit attribution in multilabeled corpora. • [Andrzejewski+ 2009] Incorporating domain knowledge into topic modeling via Dirichlet forest priors. • [Andrzejewski+ 2011] A framework for incorporating general domain knowledge into latent Dirichlet allocation using first-order logic. • [Xie+ 2015] Incorporating word correlation knowledge into topic modeling. • [Mimno+ 2011] Optimizing semantic coherence in topic models. • [Röder+ 2015] Exploring the space of topic coherence measures.