このページは http://www.slideshare.net/I_eric_Y/dsirnlp6-i-ericy の内容を掲載しています。

掲載を希望されないスライド著者の方は、削除申請よりご連絡下さい。

埋込み型プレイヤーを使用せず、常に元のサイトでご覧になりたい方は、自動遷移設定をご利用下さい。

- 自己紹介

• 観測データの背後にある構造（モデリング）に興味

– Dynamic Bayesian Net, Gaussian Process, Latent Dirichlet Al oca?on,

(Hierarchical) Dirichlet Process, Indian Buﬀet Process, Inﬁnite

Rela?onal Model…

• 学生時代はJX通信社でバイト

– 記事の自動収集アプリVingowで記事の自動要約機能開発に従事

2 - outline

• probabilis?c modeling

• 大規模データへの対処

– サブサンプリング

– モデルの変形

– 並列化・ストリーム

• まとめ

※スライド中の図や表は論文から引用しております.

3 - LAT p

ENT D ro

IRICHLET b

A

ab

LLOCATION

ilis?c modeling

• 観測データの背後にある潜在的構造を確率的に表現

BLEI, NG, AND JORDAN

• 例： Latent Dirichlet Al oca?on[D. Blei et al., JMLR,2003]

η

β k

β_k 〜Dirchlet(η)

θ_m〜Dirichlet(α)

z_n〜Mul?nomial(θ_m) LATENT DIRICHLET ALLOCATION

α

θ

z

w

N

w_n〜Mul?nomial(β_k) | z_n=k

M

Figure 7: Graphical model representation of the smoothed LDA model.

– 文書(bag- ‐of- ‐words)の背後にある構造を多項分布やディリクレ分布で表現

These–

two 潜在変数

steps are repeated zは単語

until the lower

wがど

bound on the のト

log lik ピッ

elihood ク

con に

ver 属す

ges.

るかを表現

• In Appendix A.4, we show that the M-step update for the conditional multinomial parameter β

can潜在変数やパラ

be written out analytically:

メータの学習をすると興味深い可視化ができたりする

–

N

p(θ|X)∝p(X| M d

βi j θ

∝

)p(

∑ ∑ φ⇤dniwj .

(9)

dn

d=1 n=1 θ)

We further show that the M-step update for Dirichlet parameter α can be implemented using an

The William Randolph Hearst Foundation will give $1.25 million to Lincoln Center, Metropoli-

efficient Newton-Raphson method in which the Hessian is inverted in linear time.

tan Opera Co., New York Philharmonic and Juilliard School. “Our board felt that we had a

real

5.4 Smoothing

opportunity to make a mark on the future of the performing arts with these grants an act

every bit as important as our traditional areas of support in health, medical research, education

The large vocabulary size that is characteristic of many document corpora creates serious problems

and the social services,” Hearst Foundation President Randolph A. Hearst said Monday in

of sparsity. A new document is very likely to contain words that did not appear in any of the

announcing the grants. Lincoln Center’s share will be $200,000 for its new building, which

documents in a training corpus. Maximum likelihood estimates of the multinomial parameters

will house young artists and provide new public facilities. The Metropolitan Opera Co. and

assign zero probability to such words, and thus zero probability to new documents. The standard

New York Philharmonic will receive $400,000 each. The Juilliard School, where music and

approach to coping with this problem is to “smooth” the multinomial parameters, assigning positive

the

probability

performing

to

arts all

are vocabulary

taught,

items

will get whether or

$250,000. not the

The y are observ

Hearst F

ed in the

oundation, training

a

set

leading

(Jelinek,

supporter

of the 1997). Laplace

Lincoln

smoothing

Center

is commonly

Consolidated

used;

Corporate

this essentially

Fund, will

yields

make its the mean

usual

of the

annual

posterior

$100,000

distrib

donation,

ution

too.

under a uniform Dirichlet prior on the multinomial parameters.

Unfortunately, in the mixture model setting, simple Laplace smoothing is no longer justified as a

Figure 8: maximum

An e

a posteriori

xample article method

from

(although

the AP

it is

corpus. often implemented

Each color

in

codes a practice;

dif

cf.

ferent f Nigam

actor

et al.,

from

1999).

which

In fact,

the w by

ord placing

is

a

putati Dirichlet

vely

prior on

generated. the multinomial parameter we obtain an intractable posterior

4

in the mixture model setting, for much the same reason that one obtains an intractable posterior

Thein

William Randolph Hearst Foundation will give $1.25 million to Lincoln Center, Metropoli-

the basic LDA model. Our proposed solution to this problem is to simply apply variational inference

tan Opera Co., New York Philharmonic and Juilliard School. “Our board felt that we had a

methods to the extended model that includes Dirichlet smoothing on the multinomial parameter.

real opportunity to make a mark on the future of the performing arts with these grants an act

In the LDA setting, we obtain the extended graphical model shown in Figure 7. We treat

e β

v as

ery bit as important as our traditional areas of support in health, medical research, education

a k ⇥ V random matrix (one row for each mixture component), where we assume that each row

and the social services,” Hearst Foundation President Randolph A. Hearst said Monday in

is independently drawn from an exchangeable Dirichlet distribution.2 We now extend our infer-

announcing the grants. Lincoln Center’s share will be $200,000 for its new building, which

ence procedures to treat the βi as random variables that are endowed with a posterior distribution,

will house young artists and provide new public facilities. The Metropolitan Opera Co. and

2. An exchangeable Dirichlet is simply a Dirichlet distribution with a single scalar parameter η. The density is the same

New York Philharmonic will receive $400,000 each. The Juilliard School, where music and

as a Dirichlet (Eq. 1) where αi = η for each component.

1009

the performing arts are taught, will get $250,000. The Hearst Foundation, a leading supporter

of the Lincoln Center Consolidated Corporate Fund, will make its usual annual $100,000

1006

donation, too.

Figure 8: An example article from the AP corpus. Each color codes a different factor from which

the word is putatively generated.

1009 - probabilis?c modeling

• 研究では

– 観測データをいかに表現・予測できるか

• 周辺尤度やperplexityによってモデル自体を評価

• 応用では

– 潜在変数を使って有効な知見を得る

• トピックによる可視化・データマイニング

– 複雑な問題を解く

• 例：クラスタリングしながら複数の軌跡の非線形回帰, さらに手動でクラ

スタに制約を入れたい

Nonparametr ic Mixture of Gaussian Processes with Constraints

– J. Ross et al, Nonparametric Mixture of Gaussian Processes with Constraints, JMLR 2013.

– 次元圧縮・特徴量抽出

z1

z2

z3

– Fisher kernelのようにカーネルとして使う

• 基本的に教師な

z4 し学習

z 5

z6

– 教師データ作らなくていい

– 高速に解ければ大規模データに向いてる？

z7

z8

z9

Figure 2. Example MRF illustrating disconnected sub-

5

graphs. Each graph edge represents either a must-link or

cannot-link constraint.

Figure 4. Illustration of an algorithm output for the un-

constrained case. While the curves provide a reasonable

explanation of the data, it may not be the solution of in-

terest.

given to the reported parameter settings for ↵, ✓0, or

✓1. Rather, a coarse parameter selection of reasonable

values was used.

5.1. Experiments on Synthetic Data

We tested algorithm performance on two synthetic

datasets. The first consists of noisy samples taken

from two curves: a sinusoid and a sinusoid with a

modest linear o↵set. The second dataset is made up

of noisy samples taken from two interlaced helices in

Figure 5. Left: example of learned regressors during train-

3D. For both cases, the algorithm was run for 50 iter-

ing for the unconstrained cases. Right: learned regressors

ations. ↵ was set to 1.0, and ✓

using constraints. Plots are taken from di↵erent folds.

0 was set to 1.0 in both

cases. For the sinusoids experiment,

2 = 0.02 and

✓1 = 0.005. For the helices experiment, 2 = 0.1 and

with added constraints in both cases. The center plots

✓1 = 0.0005. Must-link constraints were generated by

illustrate the regression curves found by the algorithm,

randomly choosing pairs of points from a given func-

and the data instances are color coded according to

tion, preferring pairs that are spaced farther apart.

their association to each curve.

Cannot-link constraints were generated by randomly

choosing pairs of points from di↵erent functions, pre-

Without constraints, the algorithm has a greater ten-

ferring pairs in regions where the functions tend to be

dency to converge on solutions that may not be of

closer to one another.

interest. This is illustrated in Figure 4. While the so-

lution shown does a reasonably good job of explaining

We

used

normalized

mutual

information

the data, this particular solution might not be “opti-

(NMI ) (Strehl & Ghosh, 2003) to investigate al-

mal”. By adding constraints, the optimization land-

gorithm performance for a number of di↵erent

scape is modified to one more favorable for finding

constraints. Letting A represent the cluster assign-

interesting solutions.

ments determined by the algorithm and B represent

the ground-truth cluster assignments, the NMI is

5.2. Experiments on Real-World Data

given by N M I = H(A) H(A|B)

p

, where H(·) is the

H(A)H(B)

As stated in the introduction, the motivation for our

entropy. Higher NMI values mean that the clustering

model stems from the need to identify clinically mean-

results are more similar to ground-truth; the criterion

ingful subtypes of lung disease. Here we show results

reaches its maximum value of one when there is

on data from the Normative Aging Study (NAS) (Bell,

perfect agreement.

1972), a longitudinal study designed to investigate the

Figure 3 gives the results of the synthetic experiments.

role of aging on various health issues, including lung

Each entry in the rightmost plot of this figure repre-

function. The complete dataset includes a large num-

sents the average NMI score across fifty, randomly ini-

ber of features; here we focus on the e↵ects of age on

tialized runs. There is a clear increase in performance

a widely used measure of lung function, FEV1 (forced - flexibility of modeling different image categories in one

separating the tag generation process from the generation process

generative process. In this paper, we extend the HDP model to

of the resource content. While the resource content (such as text

deal with both users’ perspective and the semantic components

words) is only generated from resource topics, the social tags are

derived from image contents. We name the new model as

generated by both resource topic and user perspective. In this

‘perspective Hierarchical Dirichlet Process (pHDP) model’ as it

model, the user perspective not only refers to the user’s interest,

integrates the user’s perspectives into the image tag generation

but also covers the user’s expertise, motivation, language and

process. The pHDP model introduces new latent variables to

other personal factors.

determine if an image tag is generated from user’s perspectives or

from the image contents. Experimental results show that the

3. MODELING USER-TAGGED IMAGES

pHDP model not only generates useful information about

In this section, we introduce the perspective HDP (pHDP) model

semantic components and user perspectives from tagged images,

for user-tagged images. We present graphical representation of

but also achieves better performance in the task of automatic

pHDP model in Fig. 1. Following the convention in depicting

image tagging compared to state-of-the-art topic models.

graphical representation of topic models, we use round nodes to

represent random variables, in which the white nodes stand for

The remainder of this paper is organized as follows. In Section 2,

latent random variables, while the gray nodes denote observed

we review related works in generative topic models. In Section 3,

ones during the model training. The rounded boxes are used to

we present the generative process of the proposed pHDP model.

represent fixed hyper-parameters of the model, while the edges

Section 4 provides the collapse Gibbs sampling algorithms for

illustrate the conditional dependency in the generative process.

model estimation. Section 5 reports the experimental results of the

automatic image tagging and compares our approach with several

For clarity, we name each tagged image as a document. Some

existing models. We conclude the paper in Section 6.

notations to be used in the two models are listed as follows: J is

the number of image documents, K and K’ (both are countable

2. BACKGROUND AND RELATED WORK

infinite) indicate the number of semantic mixture components;

when K is a finite number, the models become LDA-like models.

2.1 Generative Models for Image Tagging

To represent the image content, we utilize the saliency features

On automatic image tagging, one major task is to identify

(including visual code-words [12] and MSER feature [13]) as a

semantic mixture components from the co-existing image content

complement part of the holistic GIST features [3]. Our motivation

and text descriptions. In the data mining and information retrieval

comes from the fact that the mechanism of human visual

community, there has been a long time focus on using

perception allows for very rapid holistic image analysis to provide

probabilistic topic models to study the correlation between image

a coarse context of image scene (special layout model), yet it also

proand btext ab

descri

i

ptions. liSpecifsi?

cally, the c m

Correspondence oLDd

A

gives rise to a small set of candidate salient locations in a scene

(CorrLDA) model [1], which imposes correspondence between

(saliency model) that needs to be intensively studied [2]. In Fig. 1,

text word and other semantic entities, provides a natural way to

N t

v

r

j is the number of tags in document j, while Nj and Nj represent

learn latent semantic components (topics) from image features and

the total number of extracted visual code-words and MSER

associate them with text descriptions. Many recent studies,

el

regions i

in n

docu g

ment j, respectively. In the model, the holistic

including sophisticated topic models that associate image features

representation of an image is replicated 10 times to enable the

with multiple types of semantic entities (such as protein entities

posterior sampling, so N hj denoted the hth replication of the

• Webサービスが生み出す

[8], ontology-ba Mu

sed

l? A

biomedical d

con rib

cepts u

[9]) te

, still なデータ

follow a similar

holistic image representation in document j. In both models, we

generative process to the prototype CorrLDA model. In CorrLDA

assume fixed value for Dirichlet process concentration parameters

model, each image document has different distribution over

–

0 and . We also assume symmetric priors u, v, t, and for

Amazon review data: 評価値

semantic mixture

, テキス

components;

ト

this f ,

eat製品ジ

ure provides ャ

th ン

e m ル

odel ,

a 著者情報を

Dirichlet distrib 同時に

utions in the 観測

models. Detailed explanations of

flexibility of adapting to different image contents. However, the

notations in following discussions are summarized in Table 1.

• 複雑なモデルが次々登場

CorrLDA model requires specifying the exact number of mixture

components, which is fixed for each image document and remains

'

u

u

0

'

0

unchanged during the model estimation. In practice, in order to

U

U

– 1サービスに1モデ

get an ルあ

optimal

って

number, the も

res い

earch い

ers ha と

ve 思う

to try out different

mixture components numbers and make a choice by comparing

j

'j

j

–

the log-likelihood, perplexity and other criteria that indicate how

大量かつ多様な潜在変数

good the model fits the data. The Hieratical Dirichlet Process

(HDP) model [5], is a nonparametric extension of the Latent

xjt

p/z

• 潜在変数・パラ

Dir メ

ich ー

let タ

All の推定が時間長く

ocation (LDA)-based topic m な

odel る

s, it enables

sj

zjl

zji

modeling documents with countable infinite mixture components,

t

• 大規模データ、複雑モ

thus provides デ

the flル

exib ility of modeling images whose actual

h

rjl

vji

j

semantic component numbers are unknown.

t

h

r

Nj

Nj

N v

Nj

j

J

X. Chen et al., Pers

2.2p

ec

M ?

od v

elei nh

g ie

U r

s ar

er’ c

s h

P iec

r al

sp edir

ctivic

e hlet

t

k

' tk

µ h

h

k

k

µ r

r

v

process for user- ‐tag

Study ge

of d im

social ag

tagg e

ing m

in ode

web- lin

based g, C

appl IKM2

ications 0

has 11.

gained

k

k

k

K

K'

K

increased popularity in the data mining community. Specifically,

L

several probabilistic generative models have been proposed to

t

'

•

t

v

伝統的・汎用的なInf

studye re

users’ nc

tagg e

ing 手法

patterns

[10, 11]. In [11], a topic-perspective

(TP) model is proposed to infer how both users’ perspective and

Fig. 1 Graphical representation of the perspective HDP

the resource content relate to the generation of social annotations.

– Markov Chain Mo

(pHDP) model for user-tagged images

It

nte

improves thC

e ar

ge l

ner o

a ：

tive pr 事後分布から

ocess of social annotations b サン

y

プリング

– Varia?onal Bayes: 事後分布を変分事後分布として近似

– そのままでは捌ききれない

– 今回は上2つを中心にした対処方法の研究紹介

• Spectral learning, splash belief propaga?on, sequen?al monte carloなどは扱わない

6 - Towards scaling up MCMC: an adaptive subsampling approach

p

sampled from a Gaussian, C

log(n)

A slight variation of this procedure is actually implemented

Towards scaling up Mark

(

ov chain

Cesa-Bianchi &

Monte

✓,✓0 would grow in

Lugosi, 2009,

Carlo:

Lemma A.12).

in practice; see Figure 3. The sequence ( t) is decreas-

ing, and each time we check in Step 19 whether or not we

an adaptive subsampling

If the empirical

appr

standard

oach

deviation ˆt of {log p(xi|✓0)

should break out of the while condition, we have to use a

log p(xi|✓)} is small, a tighter bound known as the empiri-

cal Bernstein bound (Audibert et al., 2009)

smaller t, yielding a smaller ct. Every check of Step 19

thus makes the next check less likely to succeed. Thus, it

r 2log(3/

6C

appears natural not to perform Step 19 systematically after

c

t)

✓,✓0 log(3/ t)

t = ˆt

+

,

(9)

each new

has been drawn, but rather draw several new

t

t

x⇤i

R´emi Bardenet

subsamples x⇤ between each check of Step 19. This is why

applies. While the bound of

RE

Audibert MI

et .

al. BA

( RD

2009) ENE

origi-T@GMAIL.COM

i

we introduce the variable t

Arnaud Doucet

nally covers the case where the

are drawn with replace-

look is Steps 6, 16, and 17 of Fig-

ure 3. This variable simply counts the number of times the

ment, it was サブ

early remarked ( サン

x⇤i

DO

Hoeffding, 1963) プ

UCET

that

リ

@STA

Cher-

check in Step 19 was performed. Finally, as recommended

noff bounds, such as the empirical Bernstein bound, still

ン

TS.OX.

in a

グ

AC.UK

Chris Holmes

CHOLMES@STATS.OX.AC.UK

related

Department of Statistics, University of Oxford, Oxford OX1 3TG,

setting in (Mnih et al., 2008; Mnih, 2008), we

hold whenUK

considering sampling without replacement. Fi-

augment the size of the subsample geometrically by a user-

nally, we will also consider the recent Bernstein bound of

input factor > 1 in Step 18. Obviously this modification

•

Bardenet & Maillard (2013, Theorem 3), designed specifi-

B. Rémi, A.

does not impact the fact that the correct decision is taken

cally Douce

for the case t,

of and C

sampling . Hol

without mes, “To

replacement. wards scaling up Markov chain

with probability at least 1

.

Abstract

Monte Carlo

( : an

2004, adap?

Chapter ve su

7.3)). bsam

MH pling

consists ap

in p

b roach

uilding ”,IC

an ML

er

20

godic 14.

2.2. Stopping rule construction

Markov chain Monte Carlo (MCMC)

[サブ

methods サンプリン

Mark グ

ov ]

chain of invariant distribution ⇡(✓). Given a pro-

The concentration bounds given above are helpful as they

MHSUBLHD p(x|✓), p(✓), q(✓0|✓), ✓0, Niter, X , ( t), C✓,✓0

are often deemed far too computationally

posal q(✓0|✓), the MH algorithm starts its chain at a user-

–

inten- MCMC

can に

allo お

w us い

to て新た

decide

な

whether パラ

(3)

メ

holds ー

or タ

not. を採択す

Indeed,

る1か否かの採択率計算で

for k 1 to Niter

sive to be of any practical use for large datasets.

defined ✓0, then at iteration k + 1 it proposes a candidate

on the event {|⇤⇤

2

✓ ✓

尤度の計算が重た

state

t (✓, ✓0)

⇤n(✓, ✓0)| ct}, we can de-

k 1

cide

✓0

whether or ⇠

notq(

⇤ ·|

0

n( ✓

✓, い

k)

✓0) （デ

and

> ( ー

sets

u, ✓, タ

✓

✓0k

) 数

+1

if | n

⇤⇤t ）

to(✓

This paper describes a methodology that aims

✓0

, ✓ with

)

probability

3

✓0 ⇠ q(.|✓), u ⇠ U(0,1),

h

i

to scale up the Metropolis-Hastings (MH)

algo-

(u, ✓, ✓0)| > ct additionally holds. This is illustrated

in Figure 2. Combined to the

⇡(✓0) q

concentration (✓

4

(u, ✓, ✓0) 1 log u p(✓)q(✓0|✓)

k|✓0)

inequality

n

p(✓0)q(✓|✓0)

rithm in this context. We propose an approxi-

↵(✓k, ✓0) = 1 ^

(6), we thus take the correct decision⇡(✓

with probability at

5

t 0

k) q(✓0|✓k)

mate implementation of the accept/reject

step of

least 1

t if |⇤⇤

6

t (✓, ✓0)

(u, ✓, ✓0)| > ct. In case

t

n

look 0

|⇤⇤

ant to increase t un- Y

MH that only requires evaluating the lik

elihood

t (✓, ✓0)

(u, ✓, ✓0)| ct, we w p (✓0) q(✓

p(x7

⇤⇤ 0

til the condition

=

1

|⇤⇤

^

k|✓0)

i|✓0)

)

, (1)

| > c

of a random subset of the data, yet is guaranteed

t (✓, ✓0)

(u, ✓, ✓0 p (✓ t is satisfied.

8

X ⇤ ; . Keeping track of points already used

k) q(✓0|✓k)

p(x

i=1

i|✓k)

Let

9

b 1 . Initialize batchsize to 1

to coincide with the accept/reject step based on

2 (0, 1) be a user-specified parameter. We provide a

while

construction ✓

which ensures that at the first random time T

10

DONE FALSE

the full dataset with a probability superior

k+1 is otherwise set to ✓k. When the dataset is large

–

to a データを

such (サブ

n

that |⇤⇤ (✓, ✓0) (u, ✓, ✓0) > c

correct deci-

T サン

1), プ

ev リング

aluating し

| て

the 数を

lik

T , the

減ら

elihood す

ratio appearing

11

in

while the

DONE == FALSE do

user-specified tolerance level. This adaptive sub-

sion is taken with probability at least 1

. This adaptive

12

–

x⇤

どのくらい

stopping サン

MHrule プリ

adapted ング

acceptance

from ( した

ratio (

Mnih ら

1)

et よ

is

al., い

too ？

2008)

costly

is

an operation

inspired

and rules

t+1, . . . , x⇤

b ⇠w/o repl. X \ X ⇤

. Sample new

sampling technique is an alternative to the re-

batch without replacement

どのく

by らい

bandit近似で

out the

き

algorithms, る？

Hoef

applicability of

fding

such

races ( a method.

Maron & Moore,

cent approach developed in (Korattikara et al.,

13

X ⇤ X ⇤ [ {x⇤

1993) and procedures developed to scale up boosting al-

t+1, . . . , x⇤

b }

⇣

P

h

i⌘

b

p(x⇤

2014), and it allows us to establish rigorously –

14

⇤⇤

t⇤⇤ +

log

i |✓0 )

that ある確率的な

The

gorithms to larbound

aimge of

を

this

datasets ( 満た

paper す

is

Domingo ま

to

& で

W サン

propose

atanabe, プリ

an

2000).ング

approximate imple-

1b

i=t+1

p(x⇤

i |✓)

the resulting approximate MH algorithm

F

mentation

ormally, we set

of

the this “ideal”

stopping time

MH sampler, the maximal

15

approx-

t b

–

samples

q

exactな値とサブサンプリングした値の誤差を

(1 f ⇤) log(2/

from a perturbed version of the target distribu-

t

tlook )

T = imation

n ^ inf{t error

1 : | being

⇤⇤

(10)

16

c 2C✓,✓0

t (✓, ✓0) pre-specified

(u, ✓, ✓0)| > c by

t},

the user. To achieve

2t

コントロールできる（確率的に）

tion of interest, whose total variation distance to

this, we first present the “ideal” MH sampler in

17 a slightly

tlook tlook + 1

where a ^ b denotes the minimum of a and b. In other

18

b n ^ d te . Increase batchsize geometrically

this very target is controlled explicitly. We ex-

w

non-standard

ords, if the infimum w

in ay

( .

10) is larger than n, then we

19

if |⇤⇤ (u, ✓, ✓0)| c or b > n

plore the benefits and limitations of this scheme

stop as our sampling without replacement procedure en-

sures In

20

DONE

⇤⇤ practice, the accept/reject step of the MH step is imple- TRUE

on several examples.

n(✓, ✓0) = ⇤n(✓, ✓0). Letting p > 1 and selecting

7

p 1

P

21

if ⇤⇤ > (u, ✓, ✓0)

t = mented

, we have

. Setting (c

such

ptp

by sampling

t 1 t

a uniform random

t)t 1

variable u ⇠ U(0,1)

that ( and

6)

accepting

holds, the event the candidate if and only if

22

✓k ✓0 . Accept

\

23

else ✓k ✓ . Reject

1. Introduction

E =

{|⇤⇤

(11)

t (✓, ✓0)

⇤n(✓, ✓0⇡)|(✓

0)ct q

} (✓

24 return (✓k)k=1,...,Niter

t 1

u <

k|✓0) .

(2)

⇡(✓

q(✓0|✓

Figure 3. Pseudocode of the MH algorithm with subsampled like-

Consider a dataset

k)

k)

X = {x

has probability larger than 1

under sampling without re-

1, ..., xn} and denote by

In our

placement by specific

a union

conte

bound ar xt, it follo

gument. Nowws

by from (2

definition ) and the

lihoods. indepen-

Step 16 uses a Hoeffding bound, but other choices of

p(x

concentration inequalities are possible. See main text for details.

1, ..., xn|✓) the associated likelihood for parameter ✓ 2of T , dence

if E

assumption

holds then ⇤⇤ (✓, ✓0) yields the correct decision,

T

that there is acceptance of the candidate

⇥. Henceforth we assume that the data are conditionally in-

Q

as pictured in Figure 2.

dependent, so that

if and only if

p(x

n

1, ..., xn|✓) =

p(x

i=1

i|✓). Given

a prior p(✓), Bayesian inference relies on the posterior

⇤n (✓k, ✓0) > (u, ✓k, ✓0),

(3)

⇡(✓) / p(x1, ..., xn|✓)p(✓). In most applications, this pos-

terior is intractable and we need to rely on Bayesian com-

where for ✓, ✓0 2 ⇥ we define the average log likelihood

putational tools to approximate it.

ratio ⇤n(✓, ✓0) by

n - モデルの変形

• T. Nguyen and E. Bonil a, "Fast Al oca?on Fo

ast f FGau

Allocation

ast

sofsian

Gaussian

AllocationPPrr

of oc

ocess ess

Experts E

Gaussian xp

Pr erts

ocess ”,

Experts

ICML 2014.

0.08

4

5

our method

0.08

4

5

[計算量削減]

3

FITC

4

0.06

our method

kmeans

2

random

3

3

FITC

4

– Gaussian Processによる

0.06 非線形回帰（

0.04

SMSE

O(N^3)） NLPD 1

kmeans

2

– データをクラスタリングしながら

0.02

回帰

2

3

0

random

1

Training time (hours)

0.04

• データ点に潜在変数

SMSE

、0同じ潜在変数を

−1

NLPD

1

0

kin40k

pumadyn

pole

持つデータ

kin40k

だけで

pumadyn

GP

pole

kin40k

pumadyn

pole

2

– 1クラスタに1GPが対応

0.02 、クラスタ内のデ

(a) smse

ータだけでGP

0 回帰す

(b) nlpd るので計算量が減る

(c) training time 1

– クラスタリングコストを下げ

Figure 3. るた

Predicti め

ve

に近似入れた

performance and training

り

time ofour method compared to FITC and local FITC with kmeans andTraining time (hours) random clustering.

The standardized mean square error (SMSE) and negati

−ve

1 log predictive density (NLPD) averaged across all test points are reported;

0

0

smaller is better

kin40k .pumadyn

pole

kin40k

pumadyn

pole

kin40k

pumadyn

pole

•

(a) smse

(b) nlpd

(c) training time

S. Wil iamson, A. Dubey an

Chalupka d

et al.E. Xi

(2012) ng

that ,is "P

more ar

ef all

ficient el

thanMark

k-means ovT C

able h

1. ai

Test n Mon

performance te

of

the Carl

models o

on the Million Song

for Nonparametric

Figure Mi

3. x

andtur

tends

Predicti eto

ve Mo

give

de

more ls”

performance ,

balanced IC

and ML

cluster

2

sizes.

training 01

We 3

time .

denote

Dataset. MAE is the mean absolute error and SMSE and NLPD

our model with the random and RPC initialization as FG

ofP-our areasdefined

method previously.

compared All

to GP-based

FITC methods

and

are

localreported

FITC with

with kmeans and random clustering.

[並列化]

RANDOM and

P

FG

arall P

el -RP

MC method

arkov C respecti

hain

v

Mely

o .

The standardized mean square error (SMSE)

nte and

Carlne

o g

foati

r vstandard

e

No log deviation

nparametric o

predicti vver

e

Mi 5

x runs.

ture Our

density

Mo method

dels

(FGP

(NLPD) -RANDOM

av

and

eraged across all test points are reported;

We evaluate our model against six other competitive base-

FGP-RPC) significantly outperforms all other baselines.

smaller is better.

– Dirichlet process mix

in tu

lines.

the re

The

pre : ノン

first

パラ

baseline is メ

the トリ

local ック

FITC

ベ

model イズ混合モ

described

デル

vious section, with random or RPC assignments

METHOD

SMSE

MAE

NLPD

（クラスタ数の自動決定）

of

1200" the data points to clusters. Analogous to our model, we

4000"

FGP-RANDOM

0.715

refer

Chalupka et to them

al.

as

± 0.003 6.47

AVpar±

all 0.02 3.59

el"

± 0.01

FITC

(2012) -RANDOM

that is and

8"Pro

FI

cessoT

morer"C-RP

ef C. The

ficientsecond

than k-means

–

FGP-RPC

0.723

DPのDP mixtur

Synch"

and eはD

tends P

to m

± 0.003 6.48 ± 0.02 3.58 ± 0.01

baseline

gi i(vxetu

GPSVIr2e

0 に

00) な

is

る

FITC こ

4"Pro

と

with

cessor" を証明

stochastic v

more balanced cluster

ariational

Table 1. Test performance of the models on the Million Song

FITC-RANDOM

0.761 ± 0.009 6.74 ± 0.07 3.63 ± 0.03

sizes. We denote

inference (Hensman et al., 2013)

2"Protraining

cessor"

using

3000"

VB"

B = 2000

FITC-RPC

0.832 ± 0.027 7.11

Dataset. MAE ± 0.23 3.73

is the

± 0.07

mean absolute error and SMSE and NLPD

– 各mixtureは独立に

our model

800"

Gibbs"(1"Processor)"

ty*

MCMC

inducing

with

でき

points.

the

る

Note

that

random GP

GibSV

andb I has

RPCquadratic

s"(1"Processor)"

storage

GPSVI2000

0.724

initialization asty* FGP-

± 0.005 6.53 ± 0.04 3.64 ± 0.01

are as defined previously. All GP-based methods are reported with

lexi

complexity O(B2) which limits the total number of induc-

SOD2000

0.794 ± 0.011 6.94 ± 0.08 3.69 ± 0.01

lexi

–

2000"

詳細釣り合い条件を

RANDOM and FGP-RPC method respectively.

LR

0.770

6.846

NA

ing

Perp

崩さ

points ず

that に

can MC

be

MC

used.

を

Unlik 並列化可能

e our model and local

standard deviation over 5 runs. Our method (FGP-RANDOM and

Perp

CONSTANT

1.000

8.195

NA

FITC,

400"

the inducing locations cannot be learned and must

•

NN1

FGP 1.683

-RPC)

9.900

significantly NA

outperforms all other baselines.

条件を無視し

We evて強引に

be

aluate our 並列化す

selected on some

model ad

ag る

hoc 方法も

basis.

ainst

In

six

出て

addition

other い

to る

random

competitive base-

1000" NN50

1.332

8.208

NA

selection, we also clustered the dataset into partitions using

lines. The first baseline is the local FITC model described

RPC and k-means and used the centroids as the inducing

in the

METHOD

SMSE

MAE

NLPD

0"

0"

inputs.

previous

0"

We obtained

section,

500"

1000"essentially

with1500" identical

random

2000"

results

or2500" with

RPC k- even w

0. orse

assignments

25"

than

1"

predi

4"

ction

16" using

64" the constant

256" 1024" mean.

4096"

Lin-

of the

means

data

selection

points to soTiits results

clusters.

me*(minu

are

tes)*

reported here.

Analogous The

to third

our

ear re

model, gressi

we on does onlyTi slightly

me*(minu better

tes)*

than two

8 of the GP-

baseline (SOD2000) is the standard GP regression model

based methods namely

FG FI

P-TC-RP

RA C and

NDOM SOD2000.

0.715 Ov

±erall

0.003 6.47

refer to

± 0.02 3.59 ± 0.01

where

them a subset

as FIT ofC 2000

-RA data points

NDOM is randomly

and FITC sampled

-RPC. for

The our model

second is significantly

FGbetter

P-RPthan

C all of the competing

0.723 ± 0.003 6.48 ± 0.02 3.58 ± 0.01

baseline training

(GP and

SVI2 the

00rest

0) is discarded.

is FITC For all

with of these GP-based

stochastic v methods.

ariationalIn particular, it is more accurate (for e.g. in terms

FITC-RANDOM

0.761 ± 0.009 6.74 ± 0.07 3.63 ± 0.03

(a) Testmethods,

set perp we

lexirepeat

ty agathe

ins e

t xperiments

run time 5

for times

AVp with

aralledif

l. ferent

of MAE) than all but GPSVI2000 by at least 0.27 year per

inference

(b) Test set perplexity against run time for various al-

initialization of

(Hensman parameters

et al., in the

2013) corresponding

training models.

using Bgoritsong

=hms. onUa

2000 v

nl erage.

ess oth This

erwisamounts

FITC-RP

e specifi to

C

0.832

ed,approximately

eight proces 14,000

± 0.027 7.11 ± 0.23 3.73 ± 0.07

sors

inducing We thus report

points.

their performance

Note that GPwith

SVImeans

has and standard

quadratic years in total,

storage

which is

areclearly

GPSVI

used. a

2 meaningful

000

impro

0.724 vement.

± 0.005 6.53 ± 0.04 3.64 ± 0.01

deviations over the 5 runs.

Furthermore there is noticeable

SOD20 dif

00ference in the

0.794 log

± pre-

complexity

0.011 6.94

O(B2) which limits the total number of induc-

± 0.08 3.69 ± 0.01

The remaining baselines include

dictive density of FGP-RANDOM and FGP-RPC compared to

CONSTANT, which pre-

LR

0.770

6.846

NA

ing points that can be used. Unlike our model and local

dicts the mean of the outputs; nearest neighbors with

the rest, which can be attributed to our model having local-

k = 1

CONSTANT

1.000

8.195

NA

FITC, t (

4000"

ized e

90" xperts. This encouraging result suggests the benefits

NN

he 1) and k

inducing = 50 (NN50

locations) neighbors;

cannot and

AVpar be

linear re-

allel" learned and must

NN1

Local"Gibbs"Ste 1.683

p"

9.900

NA

%

gression (

of our model when dealing with very large datasets com-

LR) – these were used in Bertin-Mahieux et al.

be selected on some ad hoc basis.

80"

SyncIn

h"

addition to random

NN50 Global"MH"Step1.332

"

8.208

NA

(2011). Table 1 shows the results ofVBall

"

methods in terms

pared to a global function or model (like linear regression

3000"

70"

nutes)

of

selection, the

%

mi

we predicti

also ve accuracy

clustered (SMSE

the

and MAE)

dataset

and

into confi-

or FITC), which may not realistically capture the charac-

partitions using

60"

dence (NLPD). The

teristics of the output space.

CONSTANT method (mean of the out-

RPC and - 並列化・ストリーム

SDA (256)

−7.3

D = 361155800

−7.4

−7.3

SDA (16)

D = 36115580

SDA (64)

D = 3611558

•

−7.6

SDA (4096)

T. Broderick, N. Boy

−7.35 d, A. Wibisono, A. W

D = 361155 ilson and M. Jordan, "Streaming

−7.35

SDA (1024)

D = 36115

Varia?onal Baye

−7.8

SVI (4096)

SVI (1024)

− s”

7.4, NIPS 2013.

SVI (256)

−7.4

– 並列・ストリーム・非同期に学習するSDA- ‐Baye

− s

8 を提案

SVI (64)

SVI (16)

τ

= 64

0

κ = 1

−7.45

−7.45

,

τ

= 64

75

0

κ = 0

,

.

– 変分事後分布のパラメータをベイズ則で更新

−

8.2

τ

= 256

5

0

κ = 0

,

.

τ

= 64

5

0

κ = 0

,

.

log predictive probability

log predictive probability

log predictive probability

τ

= 16

5

0

κ = 0

,

.

–

−7.5

−8.4

−7.5

比較対象のSto

0 chas?c

1e6 Varia?

2e6 onal In

3e6 ference(SVI,

0 Varia?o

2e6 nal Baye

4e6 s

0

1e6

2e6

3e6

+stochas?c gradient descent)

number of examples seen

number of examples seen

number of examples seen

はミニバッチサイズの選び方が課題

– SDA- ‐Baye

(a)sはロバス

SVI

ト

sensitivity to D on (b) Sensitivity to minibatch size (c) SVI sensitivity to stepsize

Wikipedia

on Wikipedia

parameters on Wikipedia

Wikipedia

−7

−7

−7

SDA (256)

32-SDA 1-SDA

SVI

SSU

−7.2

SDA (16)

SDA (4096)

Log pred prob

7

−7.2

.30

7.38

7.39

7.94

SDA (64)

−7.4

−7.5

SVI (4096)

Time (hours)

2.07

25.37

6.56

10.11

SDA (1024)

−7.6

SVI (1024)

−7.4

Nature

SVI (256)

D = 3515250

SVI (64)

−7.8

−8

32-SDA 1-SDA

SVI

SSU

D = 35152500

SVI (16)

−7.6

τ 0 = 64 κ = 0 75

,

.

D = 351525

τ 0 = 64 κ = 1

,

−8

τ

κ = 0 5

D = 35152

0 = 256,

.

= 64

5

Log pred prob

τ 0

κ = 0

,

.

log predictive probability 7.07

D = 3515

log predictive probability

log predictive probability −7.8

τ

= 16

5

0

κ = 0

,

.

7.13

7.14

7.89

−8.5

Time (hours)

0.310

7.00

1e5

1.73

2e5

1.99

3e5

0

2e5

4e5

0

1e5

2e5

3e5

number of examples seen

number of examples seen

number of examples seen

Table 1: A comparison of (1) log predictive probability of held-out data and (2)

9

running time of four algorithms: SD

(d) A-Bayes

SVI

with

sensiti 32 threads,

vity to D SDA-Bayes

on Na- with

(e) 1

Sensitivity to minibatch size (f) SVI sensitivity to stepsize

thread, SVI, and SSU.

ture

on Nature

parameters on Nature

4.2 Experiments

Figure 3: Sensitivity of SVI and SDA-Bayes to some respective parameters.

To facilitate comparison with SVI, we use the full Wikipedia corpus of [5] (rather

than the subset Wikipedia corpus

Le of [3]) and

gends the

haveNature

the corpus

same of [3] for our

top-to-bottom order as the rightmost curve points.

experiments. These two corpuses represent a range of sizes (3,611,558 training

documents for Wikipedia and 351,525 for Nature) as well as different types of

topics. We expect words in Wikipedia to represent an extremely broad range of

topics whereas we expect words in Nature to focus more on the sciences. We

further use the vocabularies of [3, 5] and SVI code available online at [17]. We

hold out 10,000 Wikipedia documents and 1,024 Nature documents (not included

in the counts above) for testing. In all cases, we fit an LDA model with K = 100

topics and hyperparameters chosen as: 8k, ↵k = 1/K, 8(k, v), ⌘kv = 1.

For both Wikipedia and Nature, we set the parameters in SVI according to

the values of the parameters described in Table 1 of [3] (minibatch size 4,096,

number of documents D correctly set in advance, step size parameters = 0.5

and ⌧0 = 64). We give single-thread SDA-Bayes and SSU the same minibatch

size. Performance and timing results are shown in Table 1, where we can see

that SVI and single-thread SDA-Bayes have comparable performance, while SSU

performs much worse. SVI is faster than single-thread SDA-Bayes.

Full SDA-Bayes improves performance and run time. We handicap SDA-

14

Bayes in the above comparisons by utilizing just a single thread. In Table 1, we

also report performance of SDA-Bayes with 32 threads and a minibatch size of

256.

11 - まとめ

• 確率的モデリングは大規模データに向いているかもしれない

– 教師データがいらない

– 柔軟にモデリングすることで多彩な目標を達成

• ただしinferenceが難しい

– 重い

– 理論上並列化できないものがあったり

– 分散処理フレームワークであまり普及しない理由？

• 今回は以下3つのくくりで紹介

– データのサンプリング

– モデルの変形

– 並列化・ストリーム

• 今回扱えなかったもの

– モデルに特化したinference, 高速混合: type- ‐based MCMC…

– 佐藤一誠さんの資料

Big Data時代の大規模ベイズ学習- ‐Stochas?c Gradient Langevin Dynamicsを中心として

hdp://www.slideshare.net/issei_sato/big- ‐datastochas?c- ‐gradient- ‐langevin- ‐dynamics

10