このページは http://www.slideshare.net/jmiseikis/learning-the-structure-of-deep-sparse-graphical-models-paper-presentation の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

4年弱前 (2012/11/02)にアップロードinテクノロジー

Full lecture presentation of a paper "Learning the Structure of Deep Sparse Graphical Models - Pa...

Full lecture presentation of a paper "Learning the Structure of Deep Sparse Graphical Models - Paper presentation" by Ryan P. Adams, Hanna M. Vallach and Zoubin Ghahramani - http://arxiv.org/pdf/1001.0160.pdf

Presented at ETH Zürich.

- Learning the structure of deep sparse graphical models

Ryan P. Adams, Hanna M. Vallach and Zoubin Ghahramani

Presented by Justinas Mišeikis

Supervisor: Alexander Vezhnevets - Deep belief networks

Deep belief networks consist

of multiple layers

Consists of visible and hidden

nodes

Visible nodes are only on the

outside layer and represent

output

Nodes are linked using

directional edges

Graphical model - Deep belief networks

Hidden layers

Visible layer - Deep belief networks

Properties:

# of layers

# of nodes in the layer

Network connectivity. We

allow connections to be

established only in

consecutive layers

Node types. Binary or

continuous? - Deep belief networks

Properties:

# of layers

# of nodes in the layer

Network connectivity. We

allow connections to be

established only in

consecutive layers

Node types. Binary or

continuous? - Deep belief networks

Properties:

# of layers

# of nodes in the layer

Network connectivity. We

allow connections to be

established only in

consecutive layers

Node types. Binary or

continuous?

Properties:

# of layers

# of nodes in the layer

Network connectivity. We

allow connections to be

established only in

consecutive layers

Node types. Binary or

continuous?

Properties:

# of layers

# of nodes in the layer

Network connectivity. We

allow connections to be

established only in

consecutive layers

Node types. Binary or

continuous?- The problem - DBN structure

What is the best structure of DBN?

Number of hidden units in each layer

Number of hidden layers

Types of unit behaviour

Connectivity

This article presents a non-parametric Bayesian approach for

learning the structure of a layered DBN - finite single layer network

Network connectivity is represented using binary matrices.

Columns and rows represent nodes

Zero (non-filled) - no connectivity

One (filled) - a connection

ViHidden

si

Hidden

bl

e

Visible - finite single layer network

Network connectivity is represented using binary matrices.

Columns and rows represent nodes

Zero (non-filled) - no connectivity

One (filled) - a connection

ViHidden

si

Hidden

bl

e

Visible - finite single layer network

Network connectivity is represented using binary matrices.

Columns and rows represent nodes

Zero (non-filled) - no connectivity

One (filled) - a connection

ViHidden

si

Hidden

bl

e

Visible

Network connectivity is represented using binary matrices.

Columns and rows represent nodes

Zero (non-filled) - no connectivity

One (filled) - a connection

ViHidden

si

Hidden

bl

e

Visible

Network connectivity is represented using binary matrices.

Columns and rows represent nodes

Zero (non-filled) - no connectivity

One (filled) - a connection

ViHidden

si

Hidden

bl

e

Visible- finite single layer network

Network dimensions for a prior have to be defined in advance

How many hidden units there should be?

Not sure

Can we have an infinite amount of hidden units?

Solution: the Indian Buffet Process - the Indian buffet process

The Indian buffet process (IBP) is a stochastic process defining a

probability distribution over equivalence classes of sparse binary

matrices with a finite number of rows and an unbounded number of

columns. *

Rows - customers (visible layer), finite number of units

Columns - dishes (hidden layer), unbounded number of countable

units

The IBP creates sparse matrices with a posterior of finite number of

non-zero columns. However during the learning process, matrix

growth column-wise is unlimited.

* Thomas L. Griffiths, Zoubin Ghahramani. The Indian Buffet Process: An Introduction and Review. 2011

http://jmlr.csail.mit.edu/papers/volume12/griffiths11a/griffiths11a.pdf - the Indian buffet process

C

u

st

Dishes

o

m

1st Customer tries 2 new dishes

er

s

...

Parameters: α and β

ηk - number of previous

customers that have tried the dish

jth customer tries:

Previously tasted dish k with probability ηk / (j + β - 1)

Poisson distribution with param αβ / (j + β - 1) of new dishes - the Indian buffet process

C

u

st

Dishes

o

m

1st Customer tries 2 new dishes

er

2nd Customer tries 1 old dish + 2 new

s

...

Parameters: α and β

ηk - number of previous

customers that have tried the dish

jth customer tries:

Previously tasted dish k with probability ηk / (j + β - 1)

Poisson distribution with param αβ / (j + β - 1) of new dishes - the Indian buffet process

C

u

st

Dishes

o

m

1st Customer tries 2 new dishes

er

2nd Customer tries 1 old dish + 2 new

s

3rd Customer tries 2 old dishes + 1 new

...

Parameters: α and β

ηk - number of previous

customers that have tried the dish

jth customer tries:

Previously tasted dish k with probability ηk / (j + β - 1)

Poisson distribution with param αβ / (j + β - 1) of new dishes - the Indian buffet process

C

u

st

Dishes

o

m

1st Customer tries 2 new dishes

er

2nd Customer tries 1 old dish + 2 new

s

3rd Customer tries 2 old dishes + 1 new

4th Customer tries 2 old dishes + 2 new

...

Parameters: α and β

ηk - number of previous

customers that have tried the dish

jth customer tries:

Previously tasted dish k with probability ηk / (j + β - 1)

Poisson distribution with param αβ / (j + β - 1) of new dishes - the Indian buffet process

C

u

st

Dishes

o

m

1st Customer tries 2 new dishes

er

2nd Customer tries 1 old dish + 2 new

s

3rd Customer tries 2 old dishes + 1 new

4th Customer tries 2 old dishes + 2 new

5th Customer tries 4 old dishes + 2 new

...

Parameters: α and β

ηk - number of previous

customers that have tried the dish

jth customer tries:

Previously tasted dish k with probability ηk / (j + β - 1)

Poisson distribution with param αβ / (j + β - 1) of new dishes - the Indian buffet process

C

u

st

Dishes

o

m

1st Customer tries 2 new dishes

er

2nd Customer tries 1 old dish + 2 new

s

3rd Customer tries 2 old dishes + 1 new

4th Customer tries 2 old dishes + 2 new

5th Customer tries 4 old dishes + 2 new

...

If no more customers come in, marked binary matrix would define

the structure of the deep belief network. - Multi layer network

Single-layer: hidden units are independent

Multi-layer: hidden units can be dependent

Solution: extend the IBP to have unlimited number of layers -> deep

belief network with unbounded width and depth

While a belief network with an infinitely-wide hidden layer can represent any probability

distribution arbitrarily closely, it is not necessarily a useful prior on such distributions.

Without intra-layer connections, the the hidden units are independent a priori. This

“shallowness” is a strong assumption that weakens the model in practice and the explosion

of recent literature on deep belief networks speaks to the empirical success of belief

networks with more hidden structure. - Cascading IBP

Cascading Indian Buffet Process builds a prior on belief networks

that are unbounded in both width and depth

Prior has the following properties

Each of the “dishes” in the restaurant of the layer m are also

“customers” in the restaurant of the layer m+1

Columns in layer m binary matrix correspond to the rows in the

layer m+1 binary matrix

The matrices in the CIBP are constructed in a sequence starting

with m = 0, the visible layer

Number of non-zero columns in the matrix m+1 is determined

entirely by active non-zero columns in the previous matrix m - Cascading IBP

Layer 1 has 5 customers who tasted 5 dishes in total

Layer 1 - Cascading IBP

Layer 1 has 5 customers who tasted 5 dishes in total

Layer 2 ‘inherits’ 5 customers <- 5 dishes in the previous layer

Layer 1

Layer 2 - Cascading IBP

Layer 1 has 5 customers who tasted 5 dishes in total

Layer 2 ‘inherits’ 5 customers <- 5 dishes in the previous layer

These 5 customers in layer 2 taste 7 dishes in total

Layer 1

Layer 2 - Cascading IBP

Layer 1 has 5 customers who tasted 5 dishes in total

Layer 2 ‘inherits’ 5 customers <- 5 dishes in the previous layer

These 5 customers in layer 2 taste 7 dishes in total

Layer 3 ‘inherits’ 7 customers <- 7 dishes in the previous layer

Layer 1

Layer 2

Layer 3 - Cascading IBP

Layer 1 has 5 customers who tasted 5 dishes in total

Layer 2 ‘inherits’ 5 customers <- 5 dishes in the previous layer

These 5 customers in layer 2 taste 7 dishes in total

Layer 3 ‘inherits’ 7 customers <- 7 dishes in the previous layer

Continues until in one layer customers taste zero dishes

...

Layer 1

Layer 2

Layer 3 - CIBP parameters

Two main parameters: α and β

α - defines the expected in-degree of each unit, or number of

parents

β - controls the expected out-degree, or number of children, by the

following equation:

K(m) is number of columns in the layer m

α and β are layer specific, they are not constant in the whole

network. They can be written as α(m) and β(m) - CIBP convergence

Does CIBP eventually converge to create finite depth DBN?

Yes!

How?

Applying this transition distribution to the Markov chain:

It is simply a Poisson distribution with mean λ(K(m); α, β)

The absorbing state, where no ‘dishes’ are tasted, will always be

reached

Full mathematical proof of the convergence is given in the appendix

of the paper - CIBP convergence

α = 3, β = 1 - CIBP based prior samples
- node types

Nonlinear Gaussian belief network (NLGBN) framework is used.

Distribution u = Gaussian noise precision ν + activation sum y

Then the noisy sum is transformed with sigmoid function σ( )

∙

Black line shows the zero mean distribution

Blue line shows pre-sigmoid mean of -1

Red line shows pre-sigmoid mean of +1

Binary

Gaussian

Deterministic - inference: joint distribution

precision of

Gaussian noise

input data

activations

bias weights

in-layer units

NLGBP

weights matrix

distribution

layer number

number of

observations - Markov chain monte Carlo

* Christophe Andreu, Nando de Freitas, Arnaud Doucet, Michael I. Jordan. An Introduction to MCMC for Machine

Learning. 2003.

http://www.cs.princeton.edu/courses/archive/spr06/cos598C/papers/AndrieuFreitasDoucetJordan2003.pdf - inference

Task: find the posterior distribution over the structure and the

parameters of the network

Conditioning is used in order to update the model part-by-part

rather than modifying the whole model at each time instance

Process is split into four parts

Edges: Sample posterior distribution over its weight

Activations: sample from the posterior distributions over the

Gaussian noise precision

Structure: sample ancestors of the visible units

Parameters: closely tied with hyper-parameters - inference

Task: find the posterior distribution over the structure and the

parameters of the network

Conditioning is used in order to update the model part-by-part

rather than modifying the whole model at each time instance

Process is split into four parts

Edges: Sample posterior distribution over its weight

Activations: sample from the posterior distributions over the

Gaussian noise precision

Structure: sample ancestors of the visible units

Parameters: closely tied with hyper-parameters - Sampling from the structure

L Layer 2

a

y

er

First phase:

1

For each layer - Sampling from the structure

L Layer 2

a

y

er

First phase:

1

For each layer

For each unit in the layer - Sampling from the structure

L Layer 2

a

y

er

First phase:

1

For each layer

For each unit in the layer

Check each connected unit in

the layer m+1 indexed by k’ - Sampling from the structure

L Layer 2

a

y

er

First phase:

1

For each layer

For each unit in the layer

Check each connected unit in

the layer m+1 indexed by k’

Calculate non-zero entries in

the k’th column of binary

matrix excluding entry in kth

row

1 - Sampling from the structure

L Layer 2

a

y

er

First phase:

1

For each layer

For each unit in the layer

Check each connected unit in

the layer m+1 indexed by k’

Calculate non-zero entries in

the k’th column of binary

matrix excluding entry in kth

row

1 - Sampling from the structure

L Layer 2

a

y

er

First phase:

1

For each layer

For each unit in the layer

Check each connected unit in

the layer m+1 indexed by k’

Calculate non-zero entries in

the k’th column of binary

matrix excluding entry in kth

row

1

2 - Sampling from the structure

L Layer 2

a

y

er

First phase:

1

For each layer

For each unit in the layer

Check each connected unit in

the layer m+1 indexed by k’

Calculate non-zero entries in

the k’th column of binary

matrix excluding entry in kth

row

1

2 - Sampling from the structure

L Layer 2

a

y

er

First phase:

1

For each layer

For each unit in the layer

Check each connected unit in

the layer m+1 indexed by k’

Calculate non-zero entries in

the k’th column of binary

matrix excluding entry in kth

row

1

2

1

1

0

If the sum is zero, the unit k’

is a singleton parent - Sampling from the structure

L Layer 2

a

y

er

Second phase:

1

Considers only singletons

Option a: add new parent

Option b: delete connection

to child k

Decisions are made by the

Metropolis-Hastings operator

using birth/death process

In the end: units that are not

1

2

1

1

0

ancestors of the visible units

are discarded - experiments

Three datasets of images were used for experiments

Olivetti faces

MNIST Digit data

Frey faces

Performance test - image reconstruction

Bottom halves of images were removed and the model had to

reconstruct the missing data by ‘seeing’ only top half

Top-bottom approach was chosen instead of left-right because

both faces and numbers have left-right symmetry making it easier - Olivetti faces

350 + 50 images of 40 distinct subjects, 64x64

~3 hidden layers: around 70 units in each layer - Olivetti faces

Raw predictive fantasies from the model - mnist digit data

50 + 10 images of 10 digits, 28x28

~3 hidden layers: 120, 100, 70 units in hidden layers - frey faces

1865 + 100 images of a single face, different expressions, 20x28

~3 hidden layers: 260, 120, 35 units in hidden layers - discussion

Addresses the issues with deep belief networks

Unites two areas of research: nonparametric Bayesian methods and

deep belief networks

Introduced the cascading Indian buffet process to have unbounded

number of layers

CIBP always converges

Result: algorithm learns the effective model complexity - Discussion

Very processor intensive algorithm

finding reconstructions took ‘few hours of CPU time’

Much better than fixed dimensionality DPNs? - Thank you!