このページは http://www.slideshare.net/yutakashino/automatic-variational-inference-in-stan-nips2015yomi20160120 の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

9ヶ月前 (2016/01/20)にアップロードinテクノロジー

Automatic Variational Inference in Stan

- 深層学習ライブラリのプログラミングモデル10ヶ月前 by Yuta Kashino
- TensorFlow White Paperを読む11ヶ月前 by Yuta Kashino
- Deep learning Libs @twm約1年前 by Yuta Kashino

- Yuta Kashino

• BakFoo, Inc. CEO

• Zope / Python

• Astro Physics /Observational Cosmology

• Realtime Data Platform for Enterprise - Automatic Variational

Inference in Stan

3 - ADVI in Stan

Automatic Variational Inference in Stan

Alp Kucukelbir

Rajesh Ranganath

Columbia University

Princeton University

alp@cs.columbia.edu

rajeshr@cs.princeton.edu

Andrew Gelman

David M. Blei

Columbia University

Columbia University

gelman@stat.columbia.edu

david.blei@columbia.edu

Abstract

Variational inference is a scalable technique for approximate Bayesian inference.

Deriving variational inference algorithms requires tedious model-specific calcula-

tions; this makes it difficult for non-experts to use. We propose an automatic varia-

tional inference algorithm, automatic differentiation variational inference (

);

we implement it in Stan (code available), a probabilistic programming system. In

the user provides a Bayesian model and a dataset, nothing else. We make

no conjugacy assumptions and support a broad class of models. The algorithm

automatically determines an appropriate variational family and optimizes the vari-

ational objective. We compare

to

sampling across hierarchical gen-

eralized linear models, nonconjugate matrix factorization, and a mixture model.

We train the mixture model on a quarter million images. With

we can use

variational inference on any model we write in Stan.

4

1 Introduction

Bayesian inference is a powerful framework for analyzing data. We design a model for data using

latent variables; we then analyze data by calculating the posterior density of the latent variables. For

machine learning models, calculating the posterior is often difficult; we resort to approximation.

Variational inference ( ) approximates the posterior with a simpler distribution [1, 2]. We search

over a family of simple distributions and find the member closest to the posterior. This turns ap-

proximate inference into optimization.

has had a tremendous impact on machine learning; it is

typically faster than Markov chain Monte Carlo (

) sampling (as we show here too) and has

recently scaled up to massive data [3].

Unfortunately,

algorithms are difficult to derive. We must first define the family of approximating

distributions, and then calculate model-specific quantities relative to that family to solve the varia-

tional optimization problem. Both steps require expert knowledge. The resulting algorithm is tied to

both the model and the chosen approximation.

In this paper we develop a method for automating variational inference, automatic differentiation

variational inference (

). Given any model from a wide class (specifically, probability models

differentiable with respect to their latent variables),

determines an appropriate variational fam-

ily and an algorithm for optimizing the corresponding variational objective. We implement

in

Stan [4], a flexible probabilistic programming system. Stan describes a high-level language to define

probabilistic models (e.g., Figure 2) as well as a model compiler, a library of transformations, and an

efficient automatic differentiation toolbox. With

we can now use variational inference on any

model we write in Stan.1 (See Appendices F to J.)

1

is available in Stan 2.8. See Appendix C.

1 - Objective

• Automating Variational Inference (VI)

• ADVI : Automatic Differentiation VI

• give some probability model /w latent variables

• get some data

• inference the latent var.

• Implementation on Stan

5 - Objective

Model

data

˛ D 1:5; D 1

{

i n t N;

// number o f o b s e r v a t i o n s

i n t x [N ] ; // d i s c r e t e - valued o b s e r v a t i o n s

}parameters {

// l a t e n t v a r i a b l e , must be p o s i t i v e

r e a l < lower=0> theta ;

✓

}

model {

// non - conjugate p r i o r f o r l a t e n t v a r i a b l e

theta ~ w e i b u l l ( 1 . 5 , 1) ;

xn

// l i k e l i h o o d

f o r ( n in 1 :N)

x [ n ] ~ p o i s s o n ( theta ) ;

N

}

Figure 2: Specifying a simple nonconjugate probability model in Stan.

analysis posits a prior density p.✓/ on the latent variables. Combining the likelihood with the prior

gives the joint density p.X; ✓/ D p.X j ✓/ p.✓/.

We focus on approximate inference for differentiable probability models. These models have contin-

uous latent variables ✓. They also have a gradient of the log-joint with respect to the latent variables

˚

r log

✓

p.X; ✓/. The gradient is valid within the support of the prior supp.p.✓// D

✓ j ✓ 2

RK and p.✓/ > 0 ✓ RK, where K is the dimension of the latent variable space. This support set

is important: it determines the support of the posterior density and plays a key role later in the paper.

We make no assumptions about conjugacy, either full or conditional.2

For example, consider a model that contains a Poisson likelihood with unknown rate, p.x j ✓/. The

observed variable x is discrete; the latent rate ✓ is continuous and positive. Place a Weibull prior

Big

on ✓, defined over the positive real numbers. The resulting joint density describes a nonconjugate

differentiable probability model. (See Figure 2.) Its partial

ADVI derivative@=@✓ p.x;✓/isvalidwithinthe

support of the Weibull distribution, supp.p.✓// D RC ⇢ R. Because this model is nonconjugate, the

Data

posterior is not a Weibull distribution. This presents a challenge for classical variational inference.

In Section 2.3, we will see how

handles this model.

Many machine learning models are differentiable. For example: linear and logistic regression, matrix

factorization with continuous or discrete measurements, linear dynamical systems, and Gaussian pro-

cesses. Mixture models, hidden Markov models, and topic models have discrete random variables.

Marginalizing out these discrete variables renders these models differentiable. (We show an example

in Section 3.3.) However, marginalization is not tractable for all models, such as the Ising model,

sigmoid belief networks, and (untruncated) Bayesian nonparametric models.

2.2 Variational Inference

Bayesian inference requires the posterior density p.✓ j X/, which describes how the latent variables

vary when conditioned on a set of observ

6

ations X. Many posterior densities are intractable because

their normalization constants lack closed forms. Thus, we seek to approximate the posterior.

Consider an approximating density q.✓ I / parameterized by . We make no assumptions about its

shape or support. We want to find the parameters of q.✓ I / to best match the posterior according to

some loss function. Variational inference ( ) minimizes the Kullback-Leibler ( ) divergence from

the approximation to the posterior [2],

⇤ D arg min KL.q.✓ I / k p.✓ j X//:

(1)

Typically the

divergence also lacks a closed form. Instead we maximize the evidence lower bound

(

), a proxy to the

divergence, ⇥

⇤

⇥

⇤

L. / D E

log

log

q.✓/

p.X; ✓/

Eq.✓/

q.✓ I / :

The first term is an expectation of the joint density under the approximation, and the second is the

entropy of the variational density. Maximizing the

minimizes the

divergence [1, 16].

2The posterior of a fully conjugate model is in the same family as the prior; a conditionally conjugate model

has this property within the complete conditionals of the model [3].

3 - Advantage

• very fast

Gaussian mixture model (gmm) of the imageCLEF image histogram dataset.

e

e

0

400

Predictiv

300

Predictiv

0

Log

B=50

600

Log

400

e

e

B=100

ADVI

B=500

900

NUTS [5]

800

B=1000

Averag

Averag

102

103

102

103

104

Seconds

Seconds

(a) Subset of 1000 images

(b) Full dataset of 250 000 images

• able to handle big da

Figure 1: Held-out predictivta

e accuracy results | Gaussian mixture model (

) of the image

image histogram dataset. (a)

outperforms the no-U-turn sampler (

), the default sampling

method in Stan [5]. (b)

scales to large datasets by subsampling minibatches of size B from the

• no hustle

dataset at each iteration [3]. We present more details in Section 3.3 and Appendix J.

• already available on stan

Figure 1 illustrates the advantages of our method. Consider a nonconjugate Gaussian mixture model

for analyzing natural images; this is 40 lines in Stan (Figure 10). Figure 1a illustrates Bayesian

7

inference on 1000 images. The y-axis is held-out likelihood, a measure of model fitness; the x-

axis is time on a log scale.

is orders of magnitude faster than

, a state-of-the-art

algorithm (and Stan’s default inference technique) [5]. We also study nonconjugate factorization

models and hierarchical generalized linear models in Section 3.

Figure 1b illustrates Bayesian inference on 250 000 images, the size of data we more commonly find in

machine learning. Here we use

with stochastic variational inference [3], giving an approximate

posterior in under two hours. For data like these,

techniques cannot complete the analysis.

Related work.

automates variational inference within the Stan probabilistic programming

system [4]. This draws on two major themes.

The first is a body of work that aims to generalize . Kingma and Welling [6] and Rezende et al.

[7] describe a reparameterization of the variational problem that simplifies optimization. Ranganath

et al. [8] and Salimans and Knowles [9] propose a black-box technique, one that only requires the

model and the gradient of the approximating family. Titsias and Lázaro-Gredilla [10] leverage the

gradient of the joint density for a small class of models. Here we build on and extend these ideas to

automate variational inference; we highlight technical connections as we develop the method.

The second theme is probabilistic programming. Wingate and Weber [11] study

in general proba-

bilistic programs, as supported by languages like Church [12], Venture [13], and Anglican [14]. An-

other probabilistic programming system is infer.NET, which implements variational message passing

[15], an efficient algorithm for conditionally conjugate graphical models. Stan supports a more com-

prehensive class of nonconjugate models with differentiable latent variables; see Section 2.1.

2 Automatic Differentiation Variational Inference

Automatic differentiation variational inference (

) follows a straightforward recipe. First we

transform the support of the latent variables to the real coordinate space. For example, the logarithm

transforms a positive variable, such as a standard deviation, to the real line. Then we posit a Gaussian

variational distribution to approximate the posterior. This induces a non-Gaussian approximation in

the original variable space. Last we combine automatic differentiation with stochastic optimization

to maximize the variational objective. We begin by defining the class of models we support.

2.1 Differentiable Probability Models

Consider a dataset X D x1WN with N observations. Each xn is a discrete or continuous random vec-

tor. The likelihood p.X j ✓/ relates the observations to a set of latent random variables ✓. Bayesian

2 - Introduction

• VI: difficult to derive

• define the family of approx. distrib.

• solve the variational optimisation prob.

• calculate model-specific quantities

• need expert knowledge

8 - Related Work

• Generalising VI

• reparameterization of VI: Kingma&Welling,

Rezende+

• blax-box VI: Ranganath+, Salimans&Knowles

• gradient of the joint density: Titsias&Lazaro-

Gredilla

• Probabilistic Programming

9 - Notations

data set

X = x1:N

latent variables

✓

likelifood

p(X |✓)

prior density

p(✓)

joint density

p(X

X, ✓) = p(X |✓)p(✓)

log joint gradient

r✓log p(X

X, ✓)

support of the prior

supp(p(✓)) =

{✓|✓ 2 RK and p(✓) > 0} ✓ RK

10 - Non-Conjugate

data

˛ D 1:5; D 1

{

i n t N;

// number o f o b s e r v a t i o n s

i n t x [N ] ; // d i s c r e t e - valued o b s e r v a t i o n s

}parameters {

// l a t e n t v a r i a b l e , must be p o s i t i v e

r e a l < lower=0> theta ;

✓

}model {

// non - conjugate p r i o r f o r l a t e n t v a r i a b l e

theta ~ w e i b u l l ( 1 . 5 , 1) ;

xn

// l i k e l i h o o d

f o r ( n in 1 :N)

x [ n ] ~ p o i s s o n ( theta ) ;

N

}

Figure 2: Specifying a simple nonconjugate probability model in Stan.

@ p(x, ✓)

@✓

analysis posits a prior density p.✓/ on the latent variables. Combining the likelihood with the prior

is valid within the support of Weibull diatrib.

gives the joint density p.X; ✓/ D p.X j ✓/ p.✓/.

supp (p(✓)) = R+ ⇢ R

We focus on approximate inference for differentiable probability models. These models have contin-

11

uous latent variables ✓. They also have a gradient of the log-joint with respect to the latent variables

˚

r log

✓

p.X; ✓/. The gradient is valid within the support of the prior supp.p.✓// D

✓ j ✓ 2

RK and p.✓/ > 0 ✓ RK, where K is the dimension of the latent variable space. This support set

is important: it determines the support of the posterior density and plays a key role later in the paper.

We make no assumptions about conjugacy, either full or conditional.2

For example, consider a model that contains a Poisson likelihood with unknown rate, p.x j ✓/. The

observed variable x is discrete; the latent rate ✓ is continuous and positive. Place a Weibull prior

on ✓, defined over the positive real numbers. The resulting joint density describes a nonconjugate

differentiable probability model. (See Figure 2.) Its partial derivative @=@✓ p.x; ✓/ is valid within the

support of the Weibull distribution, supp.p.✓// D RC ⇢ R. Because this model is nonconjugate, the

posterior is not a Weibull distribution. This presents a challenge for classical variational inference.

In Section 2.3, we will see how

handles this model.

Many machine learning models are differentiable. For example: linear and logistic regression, matrix

factorization with continuous or discrete measurements, linear dynamical systems, and Gaussian pro-

cesses. Mixture models, hidden Markov models, and topic models have discrete random variables.

Marginalizing out these discrete variables renders these models differentiable. (We show an example

in Section 3.3.) However, marginalization is not tractable for all models, such as the Ising model,

sigmoid belief networks, and (untruncated) Bayesian nonparametric models.

2.2 Variational Inference

Bayesian inference requires the posterior density p.✓ j X/, which describes how the latent variables

vary when conditioned on a set of observations X. Many posterior densities are intractable because

their normalization constants lack closed forms. Thus, we seek to approximate the posterior.

Consider an approximating density q.✓ I / parameterized by . We make no assumptions about its

shape or support. We want to find the parameters of q.✓ I / to best match the posterior according to

some loss function. Variational inference ( ) minimizes the Kullback-Leibler ( ) divergence from

the approximation to the posterior [2],

⇤ D arg min KL.q.✓ I / k p.✓ j X//:

(1)

Typically the

divergence also lacks a closed form. Instead we maximize the evidence lower bound

(

), a proxy to the

divergence, ⇥

⇤

⇥

⇤

L. / D E

log

log

q.✓/

p.X; ✓/

Eq.✓/

q.✓ I / :

The first term is an expectation of the joint density under the approximation, and the second is the

entropy of the variational density. Maximizing the

minimizes the

divergence [1, 16].

2The posterior of a fully conjugate model is in the same family as the prior; a conditionally conjugate model

has this property within the complete conditionals of the model [3].

3 - data

˛ D 1:5; D 1

{

i n t N;

// number o f o b s e r v a t i o n s

i n t x [N ] ; // d i s c r e t e - valued o b s e r v a t i o n s

}parameters {

// l a t e n t v a r i a b l e , must be p o s i t i v e

r e a l < lower=0> theta ;

data

˛ D

✓

1:5;

D 1

{

}int N;

// number o f o b s e r v a t i o n s

model {

i n t x [N ] ; // da

d i ta

˛ D

s c r{e t e - valued o b s e r v a t i o n s

//

1:5;non -

D co

1 njuga

}

i t

n e

t

p

N; r i o r f

// o r l a

numbetr e n

o ft ovb as r

e ir a

v b

a l

t ieo n s

par the

ame ta

ters ~{ w e i b ui l l

n t( 1x.[5N ,] ; 1)

//; d i s c r e t e - valued o b s e r v a

da ti

taon{s

data {

˛ D 1:5; D 1

˛

1:5;

}

// l a t

Dent varia

D b1le , must be positive

i n t N;

// number o f o b s e r v a t i o n s

x

i n t N;

// number o f o b s e r ivnatti x

o [

n N

s ] ; // discrete - valued observations

n

r e//

a l < l i

lok e

w l i

er h o

=0 o

> d

parameters {

theta ; int x [N] ; // discrete - valued observations

✓

}

f o r (n in 1 :N)

// l a t e n t v a r i a b l e , must be p o s i t i v} e

}

model x{ [ n ] ~ p o i s sr o

e nal (<th

lo

par e

w ta

er

ame)t ;

=0> theta ;

parameters {

N

✓

ers {

}

}

// non - conju

// l a t e n t v a r i a b l e , must be p o s i t i v e

model

gate p {rior// folarte la

n tt e n

v t

a r i v

a a

b r

l i

e a, b l e

must be p o s i t i v e

theta ~ w e i b u l l ( 1 . 5 , 1

r )

e a;

r e a l < lower=0> theta ;

l < lower=0> theta ;

Figure 2: Specifying a simple✓ nonconjug

// no} ate

n - coprobability

njugate p r i model

o r f

✓ o r linat Setan.

n t v a r i a b l e

}

theta ~ w e i b u l l ( 1 . 5 , 1) ;

x

model {

n

// l i k e l i h o o d

model {

f o rx(n in 1 :N)

// non - conjugate p r i o r f o r l a t e n t v a r i a b l e

n

// l i k e l i

// h o o

no d

n - conjugate p r i o r f o r l a t e n t v a r i a b l e

x [ n ] ~ p o i s s o fno(rt (

h n

et in

a ) ;

the 1

ta :N)

theta ~ w e i b u l l ( 1 . 5 , 1) ;

~ w e i b u l l ( 1 . 5 , 1) ;

analysis posits a prior density

N

p.✓

} / on the

x [ n ] ~ p o i s s o n ( theta ) ;

N latent variables. Combining the

xn likelihood with

// l i the

k e l i pr

h ior

o o d

gives the joint density

x

}

// l i k e l i h o o d

p.X;

n

✓/

Figure 2:

D p.X

Specifying a

j ✓/ p.✓/

simple

.

nonconjugate probability

f o r ( n in 1model

:N)

in Stan.

f o r ( n in 1 :N)

Figure 2: Specifying a simple nonconjug

x [ n ] ~ p ate

o i s s probability

x [ n ] ~ p o i s s o n ( theta ) ;

o n ( tN

heta ) ; model in Stan.

We focus on approximate inference for differentiable

N

probability

}

models. These models

}

have contin-

uous latent variables ✓. They also have a gradient of the log-joint with

Figure 2: Specifying a simpleFigur respect

e 2:

to the latent

Specifying

nonconjugate

a

var

simpleiables

analysis posits a prior density p.✓/ on the latent variables. Combining the likelihood with

probability ˚

the priornonconjug

model in Stan. ate probability model in Stan.

r log

analysis posits a prior density p.✓/ on the latent variables. Combining the likelihood with the prior

✓

p.X; ✓/. The gradient is valid within the support of the prior supp.p.✓//

gives the joint density p.X; ✓/

✓/ p.✓/.

D ✓ j ✓ 2

gives the joint density

D p.X jp.X; ✓/ D p.X j ✓/p.✓/.

RK and p.✓/ > 0 ✓ RK, where K is the dimension of the latent variable space. This support set

is

W impor

e focustant:

on it deter

appro W mines

e focus

ximate inf

anal the

on

ysis suppor

appro

erence for

posits atdiof

ximate

ff

pr the

inf

ior poster

erence f

erentiable

density ior

or di

anal

p.✓ density

ff

probability

y/sis

on the and

erentiable

posits a pla

models.

pr

latent vys

ior

ar a k

probabilitye

These y

iables. role

models.

density p.later

models✓ha

/ v in

Thesee

on

Combining the

the

the paper

models

contin-

lik ha

latent .ve

v contin-

ariables.

elihood with Combining

the prior

the likelihood with the prior

We

uousmake

latentnov assumptions

ar

uous

iables ✓. giv about

latent

They

es var

also

the conjug

iables

hav

joint ✓

e .a acy

g ,

The

density either

y also

radient

p.Xof

; ✓full

hav

/ e

the

giv

D or

a g

esp.conditional.

radient

log-joint

the

X j ✓ of

joint

/ p the

with

.✓/.2

log-joint

respect

density p. with

to

X the

; ✓/ respect

latent

D ˚

p. to

var

X j the

✓/ latent

iables

p.✓/. variables

˚

r log

The gradient is valid within the support of the prior supp

✓

p.X; ✓/.

r log

✓

p.X; ✓/. The gradient is valid within the support of the

.p.pr

✓ ior

// supp

D .p

✓ .✓

j //

✓ D

2

✓ j ✓ 2

F

We focus on approximate inf We focus

erence for on

diff approximate

erentiable

inference

probability for di

models.fferentiable

These

probability

models have

models.

contin-

These models have contin-

R or

K

example,

and p.✓/ consider

> 0 RK

✓ a

R model

andK p.

, ✓/ that

>

where 0K contains

✓

is RK

the , a Poisson

where K is

dimension of lik

the elihood

the latent with

dimension

v of

ar

unkno

the

iable

wn

latent v

space. rate,

ariable

This p.x j

space.

suppor ✓t/.

setThe

This support set

obser

is

ved

impor var

tant: iable

it

uous latent variables ✓. They uous

also latent

have a v

g ariables

radient ✓

of . The

the y also

log-joint have

with a gradient

respect to of

the the log-joint

latent var

with

iables

respect to the latent variables

x

is

deter is discrete;

impor

mines tant:

the it the

deter

supportlatent

mines

of the rate

the

pos ✓

ter is

suppor

ior continuous

t of the pos

density ter

and and

ior

play positiv

density

s a key e.

and Place

pla

role ys a

later ake

in W

y eibull

role

the

pr

later

paper. ior

in the paper.˚

˚

on

r log

radient is valid within the support of the prior supp

✓

p.X; ✓/. The g

r log

. The gradient is valid within the

.p.✓ suppor

// D t of

✓ j the

✓

pr

2 ior supp

✓

p.X; ✓/ 2

2

We

.p.✓// D

✓ j ✓ 2

✓ , defined

make no over the

We positiv

mak

assumptions e no e

about real numbers.

assumptions

conjugacy, about The

either full resulting

conjugacy

or ,

joint

either full

conditional. density

or

descr

conditional. ibes a nonconjugate

differentiable probability

RK and p.✓/ > 0 ✓ RK, whereand

K is the dimension of the latent variable space. This support set

For ex model.

ample,

(See

considerFigure

a

2.)

model Its

that par

RK tial

contains der

a P

p. iv

✓ ativ

oisson

/ > e0 @=@✓

✓ p

RK, where

.x; ✓ /

likelihood with is valid

unkno

K is the dimension of the latent variable space. This support set

wnwithin

rate, p the

.x

For example, consider a model

is

that

impor

contains

tant: it deter a Poisson

mines theislikelihood

impor

support tant:

of with

it

the posunkno

deter

terior wn

mines rate,

the

density p.x

and plays a key

j ✓/. The

support of the Weibull dis

j

support ✓/

of. The

the poster

role ior density

later in the and pla

paper. ys a key role later in the paper.

observ tr

ed ibution,

variable supp

x is .p.✓ //

discrete; D

the RC ⇢

latent R.

rate Because

✓ is

this model

continuous and is nonconjug

positive. Placeate,

a the

Weibull prior

poster

observ ior

ed is

varnot a

iable W

x eibull

is discrete;

We mak the

e no latent rate

assumptions about conjugacy, either full or conditional.2

on

✓ is continuous

We make and

no positive. Place

assumptions

a

about Weibull

conjug pr

acy, ior

either full or conditional.2

✓ ,

distr

defined ibution.

over the This

positivpresents

e real

a challeng

numbers. The e for classical

resulting joint variational

density descr inference.

ibes a nonconjugate

on

In Section - data

˛ D 1:5; D 1

{

i n t N;

// number o f o b s e r v a t i o n s

i n t x [N ] ; // d i s c r e t e - valued o b s e r v a t i o n s

}parameters {

// l a t e n t v a r i a b l e , must be p o s i t i v e

r e a l < lower=0> theta ;

✓

}model {

// non - conjugate p r i o r f o r l a t e n t v a r i a b l e

theta ~ w e i b u l l ( 1 . 5 , 1) ;

xn

// l i k e l i h o o d

f o r ( n in 1 :N)

x [ n ] ~ p o i s s o n ( theta ) ;

N

}

Figure 2: Specifying a simple nonconjugate probability model in Stan.

analysis posits a prior density p.✓/ on the latent variables. Combining the likelihood with the prior

gives the joint density p.X; ✓/ D p.X j ✓/ p.✓/.

We focus on approximate inference for differentiable probability models. These models have contin-

uous latent variables ✓. They also have a gradient of the log-joint with respect to the latent variables

˚

r log

✓

p.X; ✓/. The gradient is valid within the support of the prior supp.p.✓// D

✓ j ✓ 2

RK and p.✓/ > 0 ✓ RK, where K is the dimension of the latent variable space. This support set

is important: it determines the support of the posterior density and plays a key role later in the paper.

We make no assumptions about conjugacy, either full or conditional.2

For example, consider a model that contains a Poisson likelihood with unknown rate, p.x j ✓/. The

observed variable x is discrete; the latent rate ✓ is continuous and positive. Place a Weibull prior

on ✓, defined over the positive real numbers. The resulting joint density describes a nonconjugate

differentiable probability model. (See Figure 2.) Its partial derivative @=@✓ p.x; ✓/ is valid within the

support of the Weibull distribution, supp.p.✓// D RC ⇢ R. Because this model is nonconjugate, the

posterior is not a Weibull distribution. This presents a challenge for classical variational inference.

In Section 2.3, we will see how

handles this model.

Many machine learning models are differentiable. For example: linear and logistic regression, matrix

factorization with continuous or discrete measurements, linear dynamical systems, and Gaussian pro-

cesses. Mixture models, hidden Markov models, and topic models have discrete random variables.

Marginalizing out these discrete variables renders these models differentiable. (We show an example

in Section 3.3.) However, marginalization is not tractable for all models, such as the Ising model,

sigmoid belief networks, and (untruncated) Bayesian nonparametric models.

2.2 Variational Inference

Bayesian inference requires the posterior density p.✓ j X/, which describes how the latent variables

vary when conditioned on a set of observations X. Many posterior densities are intractable because

their nor

Varia

malization cons

tional Infer

tants lack closed forms. Thus, we ence

seek to approximate the posterior.

Consider an approximating density q.✓ I / parameterized by . We make no assumptions about its

shape or support. We want to find the parameters of q.✓ I / to best match the posterior according to

some loss function. Variational inference ( ) minimizes the Kullback-Leibler ( ) divergence from

the appro

• KL

ximation to diver

the poster gence lack

ior [2],

⇤

s of closed form

D arg min KL.q.✓ I / k p.✓ j X//:

(1)

Typically the• maximize the evidence lower bound (ELBO)

divergence also lacks a closed form. Instead we maximize the evidence lower bound

(

), a proxy to the

divergence, ⇥

⇤

⇥

⇤

L. / D E

log

log

q.✓/

p.X; ✓/

Eq.✓/

q.✓ I / :

The minimization problem from Eq. (1) becomes

The first term is an expectation of the joint density under the approximation, and the second is the

entropy of the variational density

⇤ D arg. Maximizing

max L. /

the

such that

minimizes

supp.q.✓ I the

// ✓ div

supperg

.p ence

.✓ j [1,

X//: 16].

(2)

2The posterior of a fully conjugate model is in the same family as the prior; a conditionally conjugate model

has this proper

We ty

e within

xplicitlythe complete

specify the conditionals

suppor

of

t-matc the

hing model

cons [3].

traint implied in the

divergence.3 We highlight

this • VI is dif

constraint, as we ficult to automa

do not specify the form of the te

variational approximation; thus we must ensure

that q.✓ I / stays within the support of the posterior, which is defined by the support of the prior.

Why is • non-conjuga

difficult to automate? te 3

In classical variational inference, we typically design a condition-

ally conjugate model. Then the optimal approximating family matches the prior. This satisfies the

support constraint by definition [16]. When we want to approximate models that are not condition-

ally

•

conjugblab-bo

ate, we carefullx,

y s fix

tudy ed v appr

the model and

o

designx.

custom approximations. These depend on

the model and on the choice of the approximating density.

One way to automate

is to use black-box variational inference [8, 9]. If we select a density whose

support matches the posterior, then we can directly maximize the

using Monte Carlo ( )

13

integration and stochastic optimization. Another strategy is to restrict the class of models and use a

fixed variational approximation [10]. For instance, we may use a Gaussian density for inference in

unrestrained differentiable probability models, i.e. where supp.p.✓// D RK.

We adopt a transformation-based approach. First we automatically transform the support of the latent

variables in our model to the real coordinate space. Then we posit a Gaussian variational density. The

transformation induces a non-Gaussian approximation in the original variable space and guarantees

that it stays within the support of the posterior. Here is how it works.

2.3 Automatic Transformation of Constrained Variables

Begin by transforming the support of the latent variables ✓ such that they live in the real coordinate

space RK. Define a one-to-one differentiable function T W supp.p.✓// ! RK and identify the

transformed variables as ⇣ D T .✓/. The transformed joint density g.X; ⇣/ is

ˇ

ˇ

g.X; ⇣/ D p X; T 1.⇣/ ˇ det J

ˇ

T 1 .⇣/ ;

where p is the joint density in the original latent variable space, and J

is the Jacobian of the

T 1

inverse of T . Transformations of continuous probability densities require a Jacobian; it accounts for

how the transformation warps unit volumes [17]. (See Appendix D.)

Consider again our running example. The rate ✓ lives in RC. The logarithm ⇣ D T .✓/ D log.✓/

transforms RC to the real line R. Its Jacobian adjustment is the derivative of the inverse of the

logarithm, j det JT 1.⇣/j D exp.⇣/. The transformed density is

g.x; ⇣/ D Poisson.x j exp.⇣// Weibull.exp.⇣/ I 1:5; 1/ exp.⇣/:

Figures 3a and 3b depict this transformation.

As we describe in the introduction, we implement our algorithm in Stan to enable generic inference.

Stan implements a model compiler that automatically handles transformations. It works by applying

a library of transformations and their corresponding Jacobians to the joint model density.4 This

transforms the joint density of any differentiable probability model to the real coordinate space. Now

we can choose a variational distribution independent from the model.

2.4 Implicit Non-Gaussian Variational Approximation

After the transformation, the latent variables ⇣ have support on RK. We posit a diagonal (mean-field)

Gaussian variational approximation

K

Y

q.⇣ I / D N .⇣ I ; / D

N .⇣k I k; k/:

kD1

3If supp.q/ › supp.p/ then outside the support of p we have KL.q k p/ D EqŒlog qç EqŒlog pç D 1.

4Stan provides transformations for upper and lower bounds, simplex and ordered vectors, and structured

matrices such as covariance matrices and Cholesky factors [4].

4 - Algorithm 1: Automatic differentiation variational inference (

)

Input: Dataset X D x1WN , model p.X; ✓/.

Set iteration counter i D 0 and choose a stepsize sequence ⇢.i/.

Initialize .0/ D 0 and !.0/ D 0.

while change in

is above some threshold do

Draw M samples ⌘m ⇠ N .0; I/ from the standard multivariate Gaussian.

Invert the standardization ⇣m D diag.exp .!.i///⌘m C .i/.

Approximate r L and r!L using

integration (Eqs. (4) and (5)).

Update .iC1/

.i / C ⇢.i/r L and !.iC1/ !.i/ C ⇢.i/r!L.

Increment iteration counter.

end

AD

Return

VI

⇤

Algorithm

.i / and !⇤ !.i/.

Algorithm 1: Automatic differentiation variational inference (

)

Input

encapsulates :

the Dataset

var

X

iational D

parameters and gives the fixed density

x1WN , model p.X; ✓/.

Set iteration counter i D 0 and choose a stepsize

K

sequence ⇢.i/.

Y

Initialize .0/ D 0

q. and

⌘ I 0 !.0/

; I/ DD

N 0

. .⌘ I 0; I/ D

N .⌘k I 0; 1/:

while change in

is above some threshold

kD1 do

The s

Draw M

tandardization

samples

transforms ⌘

the variational problem from Eq. (3) into

m ⇠ N .0; I/ from the standard multivariate Gaussian.

⇤

Invert the standardization ⇣m D diag.exp .

; !⇤

L. ; !/

!.i///⌘m C

.i /.

D arg max

Appro

;!

ximate r L and r!L using

integration (Eqs. (4) and (5)).

ˇ

ˇ

K

X

D U

ar pdate .iC1/

g max E

log

1

ˇ det

1

ˇ

N .⌘ I 0;

.i /

I/

p C

X; ⇢.i/

T 1 r

.S L and !.iC1/

;!.⌘// C log

J !.i/

T 1 S C ⇢.i /

;!.⌘/ r!L

C .

!k;

Increment

;!

iteration counter.

kD1

where end

we drop constant terms from the calculation. This expectation is with respect to a standard

R

Gaussian etur

and n ⇤

the

.i /

parameters and !⇤

and !

are !.i/

both .unconstrained (Figure 3c). We push the gradient

inside the expectations and apply the chain rule to get

⇥

ˇ

ˇ

r L D E

log

log ˇ det

ˇ⇤

N .⌘/ r✓

p.X; ✓/r⇣T 1.⇣/ C r⇣

JT 1.⇣/ ;

(4)

encapsulates ⇥

ˇ

ˇ

⇤

L

the variational

log p.X;parameters

✓/

T 1. and

⇣/

gives the

log ˇ fix

det ed

J density

r

ˇ

!

D E

r

r

C r

⌘ exp.!

C 1: (5)

k

N .⌘k /

✓k

⇣k

⇣k

T 1 .⇣/

k

k /

K

(The derivations are in Appendix B.)

Y

q.⌘ I 0; I/ D N .⌘ I 0; I/ D

N .⌘k I 0; 1/:

We can now compute the gradients inside the e 14

xpectation with automatic differentiation. The only

k

thing left is the expectation.

integration provides a simple

D1

approximation: draw M samples from

the standard Gaussian and evaluate the empirical mean of the gradients within the expectation [20].

The standardization transforms the variational problem from Eq. (3) into

This gives unbiased noisy gradients of the

for any differentiable probability model. We can

now

⇤

use ; !⇤

these D

g arg max

radients in L

a .s ;

toc !/

hastic optimization routine to automate variational inference.

;!

2.6 Automatic Variational Infer

ence

ˇ

ˇ

K

X

D arg max E

log

ˇ det

ˇ

N .⌘ I 0;I/

p X; T 1.S 1

;!.⌘// C log

JT 1 S 1;!.⌘/

C

!k;

;!

Equipped with unbiased noisy gradients of the

,

implements stochastic gradient ascent

kD1

(Algorithm 1). We ensure convergence by choosing a decreasing step-size sequence. In practice, we

use where

an

we

adaptiv drop

seq cons

uence tant

[21] terms

with

from

finite

the

memor calculation.

y. (See

This

Appendix E efxpectation

or details.) is with respect to a standard

Gaussian

has

and

comple the

xity O parameters

.2NMK/ per and !

iteration, are both

where M uncons

is the trained

number (F

of igure 3c).

samplesWe push

(typically the gradient

inside

between 1 the

and expectations

10).

and

Coordinate appl

ascenty the c

hashain rule

comple to

xity g

Oet

.2NK/ per pass over the dataset. We

scale

to large

⇥

datasets using stochastic optimization [3, 10]. ˇThe adjus

ˇ

tment to Algorithm 1 is

simple: r L D

sample a E

minibatch log

of size B ⌧ N from the dataset log

and ˇ det

scale the lik

ˇ⇤

elihood of the sampled

N .⌘/ r✓

p.X; ✓/r⇣T 1.⇣/ C r⇣

JT 1.⇣/ ;

(4)

minibatch by N=B [3].

⇥

The stochastic extension of

has per

ˇ

-iteration complexityˇO.2BMK/. ⇤

r

ˇ

ˇ

! L D E

r log p.X; ✓/r T 1.⇣/ C r log det J

⌘ exp.!

C 1: (5)

k

N .⌘k /

✓k

⇣k

⇣k

T 1 .⇣/

k

k /

(The derivations are in Appendix B.)

6

We can now compute the gradients inside the expectation with automatic differentiation. The only

thing left is the expectation.

integration provides a simple approximation: draw M samples from

the standard Gaussian and evaluate the empirical mean of the gradients within the expectation [20].

This gives unbiased noisy gradients of the

for any differentiable probability model. We can

now use these gradients in a stochastic optimization routine to automate variational inference.

2.6 Automatic Variational Inference

Equipped with unbiased noisy gradients of the

,

implements stochastic gradient ascent

(Algorithm 1). We ensure convergence by choosing a decreasing step-size sequence. In practice, we

use an adaptive sequence [21] with finite memory. (See Appendix E for details.)

has complexity O.2NMK/ per iteration, where M is the number of

samples (typically

between 1 and 10). Coordinate ascent

has complexity O.2NK/ per pass over the dataset. We

scale

to large datasets using stochastic optimization [3, 10]. The adjustment to Algorithm 1 is

simple: sample a minibatch of size B ⌧ N from the dataset and scale the likelihood of the sampled

minibatch by N=B [3]. The stochastic extension of

has per-iteration complexity O.2BMK/.

6 - Transformation apch

• latent var. -> real space -> standardized space

T

S ;!

Prior

Posterior

1

1

1

Approximation

Density

T 1

S 1

;!

0

1

2

3

✓

1

0

1

2 ⇣

2

1 0

1

2

⌘

(a) Latent variable space

(b) Real coordinate space

(c) Standardized space

Figure 3: Transformations for

. The purple line is the posterior. The green line is the approxi-

mation. (a) The latent variable space is

transforms the latent variable space to

• Mean-Field Gaussian

RC. (a Appr

!b) T

oxima

absorbs tion

R. (b)

The variational approximation is a Gaussian. (b!c) S ;

the parameters of the Gaussian.

!

(c) We maximize the

in the standardized space, with a fixed standard Gaussian approximation.

The vector D . 1;

; K; 1;

; K/ contains the mean and standard deviation of each Gaus-

sian factor. This defines our variational approximation in the real coordinate space. (Figure 3b.)

15

The transformation T maps the support of the latent variables to the real coordinate space; its inverse

T 1 maps back to the support of the latent variables. This implicitly defines the variational approx-

ˇ

ˇ

imation in the original latent variable space as q.T .✓/ I /ˇ det J

ˇ

T .✓ / : The transformation ensures

that the support of this approximation is always bounded by that of the true posterior in the original

latent variable space (Figure 3a). Thus we can freely optimize the

in the real coordinate space

(Figure 3b) without worrying about the support matching constraint.

The

in the real coordinate space is

ˇ

ˇ

K

K

X

L. ; / D E

log

ˇ det

ˇ

log

q.⇣/

p X; T 1.⇣/ C log

JT 1.⇣/

C

.1 C log.2⇡// C

k ;

2

kD1

where we plug in the analytic form of the Gaussian entropy. (The derivation is in Appendix A.)

We choose a diagonal Gaussian for efficiency. This choice may call to mind the Laplace approxima-

tion technique, where a second-order Taylor expansion around the maximum-a-posteriori estimate

gives a Gaussian approximation to the posterior. However, using a Gaussian variational approxima-

tion is not equivalent to the Laplace approximation [18]. The Laplace approximation relies on max-

imizing the probability density; it fails with densities that have discontinuities on its boundary. The

Gaussian approximation considers probability mass; it does not suffer this degeneracy. Furthermore,

our approach is distinct in another way: because of the transformation, the posterior approximation

in the original latent variable space (Figure 3a) is non-Gaussian.

2.5 Automatic Differentiation for Stochastic Optimization

We now maximize the

in real coordinate space,

⇤; ⇤ D arg max L. ; / such that

0:

(3)

;

We use gradient ascent to reach a local maximum of the

. Unfortunately, we cannot apply auto-

matic differentiation to the

in this form. This is because the expectation defines an intractable

integral that depends on

and ; we cannot directly represent it as a computer program. More-

over, the standard deviations in must remain positive. Thus, we employ one final transformation:

elliptical standardization5 [19], shown in Figures 3b and 3c.

First re-parameterize the Gaussian distribution with the log of the standard deviation, ! D log. /,

applied element-wise. The support of ! is now the real coordinate space and is always positive.

Then define the standardization ⌘ D S ;!.⇣/ D diag exp .!/ 1 .⇣

/. The standardization

5Also known as a “co-ordinate transformation” [7], an “invertible transformation” [10], and the “re-

parameterization trick” [6].

5 - The minimization problem from Eq. (1) becomes

⇤ D arg max L. / such that supp.q.✓ I // ✓ supp.p.✓ j X//:

(2)

The minimization problem from Eq. (1) becomes

The minimization problem from Eq. (1) becomes

The minimization problem from Eq. (1) becomes

The minimization problem

We e ⇤ from

xplicitl

D y

arg Eq.

max (1)

specify

L. becomes

⇤

the/ suppor

such t-matc

that

hing

supp.qcons

.✓ I traint

//

⇤

✓ implied

supp.p. D

in

✓ j ar

Xg

themax

//:

L.

diver/g suc

ence.

(2) h3 that

We supp.q.

highlight ✓ I // ✓ supp.p.✓ j X//:

(2)

D arg max L. / such that supp.q.✓ I // ✓ supp.p.✓ j X//:

(2)

⇤

this constraint, as we do not specify the form of the variational approximation; thus we must ensure

D arg max L. / such that supp.q.✓ I // ✓ supp.p.✓ j X//:

(2)

that

The minimization problem from Eq. (1) becomes

q.✓

We explicitly

I / stays within the support of the

specify the support-matching constraint We

pos explicitl

terior,

implied in the y specify

which is

diverg the suppor

defined by

ence.3 We t-matc

the

hing

support

highlight

cons

of traint

the pr

⇤

implied

ior.

in the

divergence.3 We highlight

this constraint, as we do not specify the form of

We the

e

var

xplicitliational

y

appro

specify the ximation;

suppor

thus

t-matc

w

hing e mus

cons t ensure

traint implied in the D arg max

divergL. /

ence.3 suc

Weh that supp

highlight .q.✓ I // ✓ supp.p.✓ j X//:

(2)

We explicitly specify

Why the

is suppor

diffi t-matc

The

cult to hing cons

minimization traint

problem

automate? In

this implied

from

cons Eq.

traint, in

this

classical v

as the

cons

(1)arwe do div

becomes

iational

not er

traint, g

inf ence.

as we

specify 3

do

erence,

the fW

wor e

not

em highlight

specify

typicall the

of the y

form

v design

ariationalaof the variational

condition-

approximation; thus we must ensure

approximation; thus we must ensure

that

this constraint,

q.✓asI w

all e

y/ stays within the support of the posterior, which is defined by the support of the prior.

do not

conjug specify

ate

the

model. form

Then ofthe optimal

that q.✓ I appro

/ sta ximating

ys within famil

the

y matc

support of hes

the the

pos pr

ter ior

W

ior,e .e This

xplicitl

which is satisfies

y specify

defined b the

y the support of the prior.

⇤ the variational appro

that q. ximation;

✓ I / sta thus

ys

we

withinmus

the t ensure

support of the posterior, which is defined by the support of the prior.

the support-matching constraint implied in the

divergence.3 We highlight

that

D arg max L. / such that supp.q.✓ I // ✓ supp.p.✓ j X//:

(2)

q.✓ I

Wh /y sta

is ys within

suppor

diffi t

cultthe

cons

to suppor

traint tb of

y

automate? the

In poster

definition ior

classical v

Wh ,

[16].

y whic

aris hdiis

When

iationalffi defined

Wh

we

inf y

cult is

want

to b

erence, y

tow the

diffi

e

suppor

cult

appro

automate? to

typicall

Iny t of

ximate the

design

classical av pr

ar ior

automate?

models .In classical

that

this are

condition-

cons

iational inf not v

traint, ar

as

erence, iational

condition-

w

w e do

inf

not

typically erence,

specify the

design aw

f e

or typicall

m of the

condition- y

vardesign

iational a condition-

approximation; thus we must ensure

ally

all

conjug y conjug

ate

ate,

model.

we

Then carefull

the

y s

optimal tudy

all the

approy

model

ximating

conjug f

ate and

amily design

matc

model. Then cus

hesthe tom

the pr appro

ior.

optimal

ximations.

This satisfies

that

approximating fThese

the

q.✓ I /

amily depend

stays

matches on

within the support of the posterior, which is defined by the support of the prior.

Why is

diffi

the prior. This satisfies the

supportcult to

cons automate?

the model

traint by W

and e In

one classical

xplicitl

the

definition c y

var

[16].specifyiational

hoice of

When the

inf

thewe suppor erence,

ally

appro

want

support

t-matc w

to

cons

hing e typicall

conjugate

ximating

appro

traint by cons

ximate traint y design

model.

density.

definition implied a

models that

[16].

in condition-

Then the optimal approximating family matches the prior. This satisfies the

arethe

not

When we w div

ant erg

to ence.3

condition-

appro We highlight

ximate models that are not condition-

ally conjugate model. Then the optimal

Why is

difficult to automate? In classical variational inference, we typically design a condition-

ally conjugate, we

this

carefullycons appro

s traint,

tudy the as ximating

we do

modelall not

y

famil

conjug y

and specify

ate, matc

design the

cus

wef hes

support

orm the

cons

tom of the

carefully spr

appro v ior

traint

ar

tudy .b This

y

iational

the

satisfies

definition

ximations. appro

model and the

[16]. When we want to approximate models that are not condition-

ximation;

These depend

ally

designthus

oncus w

conjug e

tommus

ate

t ensure

model.

appro Then the

ximations. optimal

These approximating

depend on

family matches the prior. This satisfies the

support cons

the traint

model by

One

and definition

way

on to

the c [16].

that q.✓

hoice of When

automateI / s

the is w

tay

to es

appro want

within

usethe to

the

black

ximating appro

-bo

model x ximate

ally

suppor

v

and t of

ar

on the

density. the models

conjugate,

pos

iational

c ter

inf

hoice ofthat

we

ior,the are

whic

erence h not

carefull

is

[8,

appro9].condition-

y study the model and design custom approximations. These depend on

defined

If b

w

ximating y

e the suppor

select

suppor

density. at t of

cons the

density

traint pr

byior.

whose

definition [16]. When we want to approximate models that are not condition-

ally conjugate, we carefull

support y study

matches the model

poster and

ior, design

then wecustom

the

can appro

directly ximations.

model and on

maximize These

the c

the

depend

hoice of the on

approximating density.

using

allyMonte

conjug Car

ate, lo

we (

)

carefully study the model and design custom approximations. These depend on

the model and

One w on

ay the

integ

to

choice

ration

automate of the

Why

and s

is toappro

is

tochas

use ximating

difficult to

tic

black-box density

optimization.

var

One way .

automate?

to In classical

automate

var

is iational

to use inf

blac erence,

k-box v w

ar e typicall

iational y

infdesign

erence a condition-

[8, 9]. If we select a density whose

ally conjugate model. Then

supportthe

iational

matc One

optimal

Another

infhes ws

the a

erence y to

appro

[8,

pos automate

ximating

trategy

9].

ter is

If

ior, to

w feamil

res

then w y

tre is

ict

select a to

matc

can use

hes

the

blac

the

class

density the

directly k

prof -bo

ior.

whose x

model v

maximizear

This

models

and iational

satisfies

and

on

the

use

the c inf

thea erence

hoice of the

using

[8,

Monte 9].

approCar If

lo (we

ximating select

density.

)

a density whose

support matches the posterior, then we can directly maximize the

using Monte Carlo ( )

One way to automate

fixed v

integration and s is

artoc to use

iational

hastic blac

supportk-bo

appro x

cons var

traint

optimization. iational

by

ximation [10].

Another

integ inf

F

s erence

definition [16].

or

ration ins

trategy is

and s [8,

to

toc 9].

support

When

tance, w

res

hastrIfe

tic w

matc

w

ict e

ant

ma select

hes

y the

touse

the class a

appro

a

optimization. density

posterior

ximate ,

of modelswhose

Another then

models

Gaussians

w

One e

that

density

wa

trategy can

are

f

is or

y to

to directl

not

inf

restrict y

automate maximize

condition-

erence

the in

is to use

class of th

blac e

k-box var

models and using

iational

use a

Monte

inference Car - The minimization problem from Eq. (1) becomes

⇤ D arg max L. / such that supp.q.✓ I // ✓ supp.p.✓ j X//:

(2)

We explicitly specify the support-matching constraint implied in the

divergence.3 We highlight

this constraint, as we do not specify the form of the variational approximation; thus we must ensure

that q.✓ I / stays within the support of the posterior, which is defined by the support of the prior.

Why is

difficult to automate? In classical variational inference, we typically design a condition-

ally conjugate model. Then the optimal approximating family matches the prior. This satisfies the

support constraint by definition [16]. When we want to approximate models that are not condition-

ally conjugate, we carefully study the model and design custom approximations. These depend on

the model and on the choice of the approximating density.

One way to automate

is to use black-box variational inference [8, 9]. If we select a density whose

support matches the posterior, then we can directly maximize the

using Monte Carlo ( )

integration and stochastic optimization. Another strategy is to restrict the class of models and use a

fixed variational approximation [10]. For instance, we may use a Gaussian density for inference in

unrestrained differentiable probability models, i.e. where supp.p.✓// D RK.

We adopt a transformation-based approach. First we automatically transform the support of the latent

variables in our model to the real coordinate space. Then we posit a Gaussian variational density. The

transformation induces a non-Gaussian approximation in the original variable space and guarantees

that it stays within the support of the posterior. Here is how it works.

2.3 Automatic Transformation of Constrained Variables

Begin by transforming the support of the latent variables ✓ such that they live in the real coordinate

space RK. Define a one-to-one differentiable function T W supp.p.✓// ! RK and identify the

transformed variables as ⇣ D T .✓/. The transformed joint density g.X; ⇣/ is

ˇ

ˇ

g.X; ⇣/ D p X; T 1.⇣/ ˇ det J

ˇ

T 1 .⇣/ ;

where p is the joint density in the original latent variable space, and J

is the Jacobian of the

T 1

inverse of T . Transformations of continuous probability densities require a Jacobian; it accounts for

how the transformation warps unit volumes [17]. (See Appendix D.)

Consider again our running example. The rate ✓ lives in RC. The logarithm ⇣ D T .✓/ D log.✓/

transforms RC to the real line R. Its Jacobian adjustment is the derivative of the inverse of the

logarithm, j det JT 1.⇣/j D exp.⇣/. The transformed density is

g.x; ⇣/ D Poisson.x j exp.⇣// W

T eibull.exp.⇣/ I 1:5; 1/ exp.⇣/:

S ;!

Prior

Figures 3a and 3b depict this transformation.

Posterior

As we describe in the

1 introduction, we implement our algorithm

1 in Stan to enable generic inference.

1

Approximation

Stan implements a model compiler that automaticall

T 1y handles transformations. It works b

S 1

y; appl

!

ying

Density

a library of transformations and their corresponding Jacobians to the joint model density.4 This

transforms the joint density of any differentiable probability model to the real coordinate space. Now

we can c

MF Gaussian v appr

hoose a variational distribution independent from the model.

0

1

2

3

✓

1

0

1

2 ⇣

2

1 0

1

2 ⌘

2.4 Implicit Non-Gaussian

(a) Latent Vvariational

ariable

Appr

space oximation

(b) Real

ox

coordinate space

(c) Standardized space

After the transformation, the latent variables ⇣ have support on RK. We posit a diagonal (mean-field)

Gaussian v •

ar me

iational

Figure an-field Gaussian varia

appro

3: Tximation

ransformations for

. The tional appr

purple line is

o

the x

posterior. The green line is the approxi-

mation. (a) The latent variable space is RC. (a!b) T transforms the latent variable space to R. (b)

K

Y

The variational approximation is a Gaussian. (b!c) S

absorbs the parameters of the Gaussian.

q.⇣ I / D N .⇣ I ; / D

N .⇣

;!

k I

k ;

k /:

(c) We maximize the

in the standardized space, with a fixed standard Gaussian approximation.

kD1

3If supp •

.q/ ›par

supp am vector contain me

.p/ then outside the support of p we have KL an/std.devia

.q k p/ D EqŒlog qç

Eq tion

Œlog pç D 1.

4Stan provides transformations for upper and lower bounds, simplex and ordered vectors, and structured

matrices such as co

The var

v iance

ector matrices and Cholesky factors [4].

D . 1;

; K; 1;

; K/ contains the mean and standard deviation of each Gaus-

sian factor. This defines our variational approximation in the real coordinate space. (Figure 3b.)

• tr

The ansforma

transformation tion

T

T

maps ensur

4

the

es the suppor

support of the latent var t of

iables

to the real coordinate space; its inverse

T appr

1 mapsox is always within original la

back to the support of the latent variables. tent var

This ˇ

.’s

implicitly defines the variational approx-

ˇ

imation in the original latent variable space as q.T .✓/ I /ˇ det J

ˇ

T .✓ / : The transformation ensures

that the support of this approximation is always bounded byT that of the trueSpos

;!

terior in the original

latent variable space (Figure 3a). Thus we can freely optimize the

in the real coordinate

Prior

Poster space

ior

(Figure 3b) without worrying about the suppor

1

1

1

Approximation

Density

t matching cons

T 1

traint.

S 1

;!

The

in the real coordinate space is

0

1

2

3

✓

1

0

1

2 ⇣

2

1 0

1

2

⌘

(a) Latent variable space

(b) Real coordinate space

(c) Standardized space

17

Figure ˇ

3: Transformations for ˇ

K

. The K

purple line is the posterior. The g

X

reen line is the approxi-

L. ; / D E

log

mation. ˇ

(a)

The var

det

The latent variable ˇ

space is RC. (a!b) T transforms the latent variable

iational approximation is a Gaussian. (b!c) S ; absorbs the parameters of the Gaussian.

!

log

space to R. (b)

q.⇣/

p X; T 1.⇣/ C log

JT 1.⇣/ C

.1 C log.2⇡// C

k ;

(c) We maximize the

in the s

2

tandardized space, with a fixed standard Gaussian approximation.

kD1

where we plug in the analytic form of the Gaussian

The vector D . entrop

1;

; K y

; .1; (The

; K/ deriv

contains ation

the mean is

and in

s

Appendix

tandard deviation of A.)

each Gaus-

sian factor. This defines our variational approximation in the real coordinate space. (Figure 3b.)

We choose a diagonal Gaussian for efficiency

The . This

transfor

c

mation hoice

T maps the may

support call

of the to

latent mind

variables the

to the Laplace

real coordinate appro

space; its xima-

inverse

T 1 maps back to the support of the latent variables. This implicitly defines the variational approx-

tion technique, where a second-order Taylor

ˇ

ˇ

imationexpansion

in the original

around

latent variable

the

space as maximum-a-pos

q.T .✓/ I /ˇ det J

ˇ

T .✓ / : The terior

transfor i es

mation timate

ensures

that the support of this approximation is always bounded by that of the true posterior in the original

gives a Gaussian approximation to the poster

latent ior

var .

iableHow

space e

(F ver

igure , using

3a). Thus we a Gaussian

can freely optimize var

the

iational

in the real appro

coordinatexima-

space

tion is not equivalent to the Laplace approximation

(Figure 3b)

[18].

without worr

The

ying about Laplace

the support

appro

matching

ximation

constraint.

The

in the real coordinate space is

relies on max-

imizing the probability density; it fails with densities that have discontinuities on its boundar

K

y. The

Gaussian approximation considers probability

ˇ

ˇ

K

X

L. ; mass;

/ D E

log

1

log ˇ det

ˇ

log

q.⇣/ it does

p X; T not

.⇣/ su

C ffer this

JT 1. deg

⇣/

eneracy

C

.1 C log.2⇡// C

k ;

2

. Furthermore,

our approach is distinct in another way: because of the transformation, the posterior approkximation

D1

in the original latent variable space (Figure 3a)

where we is

plug non-Gaussian.

in the analytic form of the Gaussian entropy. (The derivation is in Appendix A.)

We choose a diagonal Gaussian for efficiency. This choice may call to mind the Laplace approxima-

tion technique, where a second-order Taylor expansion around the maximum-a-posteriori estimate

gives a Gaussian approximation to the posterior. However, using a Gaussian variational approxima-

2.5 Automatic Differentiation for Stochastic

tion is not eq Op

uiv

timization

alent to the Laplace approximation [18]. The Laplace approximation relies on max-

imizing the probability density; it fails with densities that have discontinuities on its boundary. The

Gaussian approximation considers probability mass; it does no - T

S ;!

Prior

Posterior

1

1

1

Approximation

Density

T 1

S 1

;!

0

1

2

3

✓

1

0

1

2 ⇣

2

1 0

1

2

⌘

(a) Latent variable space

(b) Real coordinate space

(c) Standardized space

Figure 3: Transformations for

. The purple line is the posterior. The green line is the approxi-

mation. (a) The latent variable space is RC. (a!b) T transforms the latent variable space to R. (b)

The variational approximation is a Gaussian. (b!c) S ; absorbs the parameters of the Gaussian.

!

(c) We maximize the

in the standardized space, with a fixed standard Gaussian approximation.

The vector D . 1;

; K; 1;

; K/ contains the mean and standard deviation of each Gaus-

sian factor. This defines our variational approximation in the real coordinate space. (Figure 3b.)

MF Gaussian v appr

The transformation T maps the support of the latent variables to

ˇˇ

ˇ

det o

the

J

ˇ

T . x

real coordinate space; its inverse

T 1 maps back to the support of the latent variables. This implicitly defines the variational approx-

imation in the original latent variable space as q.T .✓/ I /

✓/ : The transformation ensures

that the support of this approximation is always bounded by that of the true posterior in the original

latent variable space (Figure 3a). Thus we can freely optimize the

in the real coordinate space

(Figure 3b) without worrying about the support matching constraint.

Gaussian Entropy

• ELBO of

The

in r

the e

real al space

coordinate space is (Appendix A)

ˇ

ˇ

K

K

X

L. ; / D E

log

ˇ det

ˇ

log

q.⇣/

p X; T 1.⇣/ C log

JT 1.⇣/

C

.1 C log.2⇡// C

k ;

2

kD1

• MF Gaussian v appr

where we plug in the analytic form of the ox: for ef

Gaussian entropy. ficiency

(The derivation is in Appendix A.)

We choose a diagonal Gaussian for efficiency. This choice may call to mind the Laplace approxima-

tion technique, where a second-order Taylor expansion around the maximum-a-posteriori estimate

Monte Carlo Integration

gives a Gaussian approximation to the posterior. However, using a Gaussian variational approxima-

tion is not equivalent to the Laplace approximation [18]. The Laplace approximation relies on max-

imizing the probability density; it fails with densities that have discontinuities on its boundary. The

• original la

Gaussian appro

tent var

ximation considers . space is not Gaussian

probability mass; it does not suffer this degeneracy. Furthermore,

our approach is distinct in another way: because of the transformation, the posterior approximation

in the original latent variable space (Figure 3a) is non-Gaussian. T

S ;!

Prior

2.5 Automatic Differentiation for Stochastic Optimization

Posterior

1

1

1

Approximation

We now maximize the

in real coordinate

Density

space,

T 1

S 1

;!

⇤; ⇤ D arg max L. ; / such that

0:

(3)

;

0

1

2

3

✓

1

0

1

2 ⇣

2

1 0

1

2

⌘

We use gradient ascent to reach a local maximum

(a)

of

Latent v the

ariable

. U

space nfortunately

(b) , w

R e

eal cannot appl

coordinate y auto-

space

(c) Standardized space

matic differentiation to the

in this form. This is because the expectation defines an intractable

integral that depends on

and ; we cannot

Figure 3: directl

Transf y

or represent

mations f it

or as a computer

. The purprog

ple ram.

line is More-

the posterior. The green line is the approxi-

18

over, the standard deviations in must remain positive. Thus, we employ one final transformation:

mation. (a) The latent variable space is RC. (a!b) T transforms the latent variable space to R. (b)

elliptical standardization5 [19], shown in Figures 3b and 3c.

The variational approximation is a Gaussian. (b!c) S ; absorbs the parameters of the Gaussian.

!

First re-parameterize the Gaussian dis

(c)tribution

We

with the

maximize

log

the

of the

in standard

the s

deviation,

tandardized sp ! D

ace, log

with.a /,fixed standard Gaussian approximation.

applied element-wise. The support of ! is now the real coordinate space and is always positive.

Then define the standardization ⌘ D S ;!.⇣/ D diag exp .!/ 1 .⇣

/. The standardization

The vector D . 1;

; K; 1;

; K/ contains the mean and standard deviation of each Gaus-

5Also known as a “co-ordinate transfor

sian mation

factor. ” [7],

This an “inv

definesertible

our v transf

ar

ormation

iational

” [10],

appro

and

ximation the

in “re-

the real coordinate space. (Figure 3b.)

parameterization trick” [6].

The transformation T maps the support of the latent variables to the real coordinate space; its inverse

T 1 maps back to the support of the latent variables. This implicitly defines the variational approx-

ˇ

ˇ

imation in 5

the original latent variable space as q.T .✓/ I /ˇ det J

ˇ

T .✓ / : The transformation ensures

that the support of this approximation is always bounded by that of the true posterior in the original

latent variable space (Figure 3a). Thus we can freely optimize the

in the real coordinate space

(Figure 3b) without worrying about the support matching constraint.

The

in the real coordinate space is

ˇ

ˇ

K

K

X

L. ; / D E

log

ˇ det

ˇ

log

q.⇣/

p X; T 1.⇣/ C log

JT 1.⇣/

C

.1 C log.2⇡// C

k ;

2

kD1

where we plug in the analytic form of the Gaussian entropy. (The derivation is in Appendix A.)

We choose a diagonal Gaussian for efficiency. This choice may call to mind the Laplace approxima-

tion technique, where a second-order Taylor expansion around the maximum-a-posteriori estimate

gives a Gaussian approximation to the posterior. However, using a Gaussian variational approxima-

tion is not equivalent to the Laplace approximation [18]. The Laplace approximation relies on max-

imizing the probability density; it fails with densities that have discontinuities on its boundary. The

Gaussian approximation considers probability mass; it does not suffer this degeneracy. Furthermore,

our approach is distinct in another way: because of the transformation, the posterior approximation

in the original latent variable space (Figure 3a) is non-Gaussian.

2.5 Automatic Differentiation for Stochastic Optimization

We now maximize the

in real coordinate space,

⇤; ⇤ D arg max L. ; / such that

0:

(3)

;

We use gradient ascent to reach a local maximum of the

. Unfortunately, we cannot apply auto-

matic differentiation to the

in this form. This is because the expectation defines an intractable

integral that depends on

and ; we cannot directly represent it as a computer program. More-

over, the standard deviations in must remain positive. Thus, we employ one final transformation:

elliptical standardization5 [19], shown in Figures 3b and 3c.

First re-parameterize the Gaussian distribution with the log of the standard deviation, ! D log. /,

applied element-wise. The support of ! is now the real coordinate space and is always positive.

Then define the standardization ⌘ D S ;!.⇣/ D diag exp .!/ 1 .⇣

/. The standardization

5Also known as a “co-ordinate transformation” [7], an “invertible transformation” [10], and the “re-

parameterization trick” [6].

5 - 2.4 Im T

plicit Non-Gaussi S

an Variational Approximation

;!

Prior

Posterior

After the transformation, the latent variables ⇣ have support on RK. We posit a diagonal

1

1

1

Approximation

(mean-field) Gaussian variational approximation

Density

T 1

S 1

;!

T

S ;!

Prior

K

Y

Posterior

2.4 Implic 0it N1on- 2

Gau 3

✓

ssian Vari 1

atio0n q

a (l 1⇣;

Ap 2

) ⇣

p =

ro N

xi (⇣

m ;a µ, 2

tion2) 1

= 0 1N 2 ⌘

(⇣k ; µk, 2k),

1

1

1

Approximation

(a) Latent variable space

(b) Real coordinate space

(c) Standardized

k=1

space

Af Density

T 1

S 1

;!

ter the transformation, the latent variables ⇣ have support on RK. We posit a diagonal

(mean-fie Figur

ld)

e 3:

Gauss Transf

ian v ormations

where

ariation th

al feor

apvpec

ro to .r The=

ximationpur

( ple

µ

2

2

1, · line

· · , isµ the

K , pos

1 , ter

· · ior

· ,. The) concatenates the mean and variance of each

K

green line is the approxi-

0

1

mation. 2(a) 3

✓

The latent

Gausvsar

ian 1

iable

fac 0

space

tor. 1

is RC

Thi .s2 ⇣

(a

d

K efines our 2

var1ia 0

tio 1

nal 2 ⌘

!b) T transforms the latenta v

par

p iable

roxi space

matio to

n R

i .

n (b)

the real coordinate space.

(a) Latent

The v

v ariable space

iational appro

(b) R

Y

q(⇣ ; ximation eal

) =

(⇣ is coordinate

; a 2

space

µ,Gaussian.

(Figure 3b

N .)

) =

(b!

N (⇣c) S ; (c) Standardized

absorbs the space

parameters of the Gaussian.

!

(c) We maximize the

in the standardized

k ; µk,

2

k),

k=1 space, with a fixed standard Gaussian approximation.

The transformation T from Equation 3 maps the support of the latent variables to

Figure 3: Transformations for

. The purple line is the posterior. The green line is the approxi-

mation.

where th (a)

e v The

ecto latent

r = v(ar

µ iable

the

2

2

1, · · · space

r,ea

µ l

K , is

coo,rdin,ate) space

concat. Th

enat u

es s

t ,heits in

meanvers

ande T 1

vari

maps

ance of

bac

each k to the support of the latent

1 R·C

· .· (a!

K b) T transforms the latent variable space to R. (b)

The

Gaus v

siar

a iational

n The

factvorappro

ector

. Thximation

isDd .veafirnia

esbl

o e

usr.vTh

ari i

a s

ti iom

n p

al liacpit

p lry

oxd

i e

mfi

a n

ti e

o sn tih

n et v

h a

e ri

r a

e t

aliocn

o a

o lr a

di p

n p

atreox

s i

p m

ac a

e.tion in the original latent variable

1; is a Gaussian.

; K; 1; (b!

;

c) S

K /

;

absorbs

contains the the parameters

mean and s

of the

tandard de Gaussian.

!

viation of each Gaus-

(c) W

(Figu e

re maximize

sian

3b.) f the

actor. This in the

space s

defines tandardized

our

as v

N ar

(T 1(sp

⇣ ace,

iational) ; with

appro

µ, 2 a

) fixed

ximation

det s

Jtandard

inT the1(⇣)Gaussian

real. The t appro

coordinate

rans ximation.

space.

for (F

ma igure

tion 3b.)

ensures that the support of this

The t The

ranstransf

form or

atimation

on ap

T from Equation 3 maps the support of the latent variables to

T

p maps

ro

the suppor

ximation is t of

alwthe

ayslatent

bou vnar

d iables

ed by to the

that real

of coordinate

the true space;

poster its

iorinverse

in the original latent variable

the real coordinate space. Thus, its inverse T 1 maps back to the support of the latent

v The

aria v

b ector

T 1 maps back to

spthe

ac suppor

e (Fig tu of

re the

3a latent

). Th v

uar

s iables.

we ca This

n fre implicitl

ely opt y

i defines

mize th the

e v

elariational

bo in th appro

e re x-

al coordinate space (Figure

les. ThiD

s i .

ˇ

ˇ

m 1

pl;icitl ;y K

de ;fin1e;s th ;evK

a /

ri contains

ational athe

pprmean

oxim and

atio s

n tandard

in the de

ori viation

ginal laof

te eac

nt h

va Gaus-

ˇ

ˇ

riable

sian

space factor

as

. This

1(defines

imation in the

⇣) ; µ, our

or32) v

b ar

) iational

iginal latent

withou

det J

( appro

var

t wo

⇣) .

ximation

iable

rr space

ying

The transfin

as

ab

or the

q.T

out

ma real

.✓/

th

tion coordinate

I /

e su det J

pport

ensures th space.

T .✓ / :

mat

at the (F

ch

suigure

Theing c

ppor 3b.)

transf

oor

n mation

straintensures

.

that

N (T the support of this appro

T 1 ximation is always bounded by that of the true poster

t ior

of t in

hi the

s

original

The

app transf

ro

ormation

latent

ximation var

The elbo in the real coordinate space is

is T

alw maps

iable

ays b the

space

oun suppor

(F

digure

ed b t

y of

th the

3a).at latent

Thus

of w

the v

e ar

truiables

cane p to

freely

osterthe

ior real

in coordinate

optimize

th the

e origin space;

al in the

laten its

t v in

realarverse

coordinate

iable

space

s T 1

pace maps

(Figubac

(Fre k

3ato

igure). the

3b)Thsuppor

without

us we tc of

w

an the

orrf e latent

ying

ely op v

tiar

aboutmiables.

the

ize theThis

suppor

elt implicitl

matc

bo in the y

hing defines

cons

real co the

traint.

ord v

in ariational

ate space appro

(Figux-

ˇ

ˇ

re

3 imation

b) with in

ou the

t w or

or iginal

rying latent

abou v

t ar

t iable

he su space

pport as

K

m q

a .T

tch.i✓

n /g Ico /nˇsdet

tra J

int.

ˇ

X

T .✓ / : The transformation ensures

K

that the

The suppor

The

elbo t

inof this

in

the appro

the

real coximation

real

L(µ, 2

ordinate is

)s alw

coordinate

=

pac a

Ee y

1

q s

space

(⇣

is bounded

is

)

log p b

( y

Xthat

, T of the

(⇣) tr

) ue

+ pos

log terior

det inJ the

T 1 or

( iginal

⇣)

+

(1 + log(2⇡)) +

log k,

latent variable space (Figure 3a). Thus we can freely optimize the

in the real coordinate space 2

k=1

(Figure 3b) without worrying about the support matchingˇ constraint. ˇ

K

K

K

X

X

L. ; /

K

=

D E

log

ˇ det

ˇ

log

q.⇣/

p X; T 1.⇣/

log p(X, T 1(⇣)) + log detC log

J

J

1 (⇣ )

+

(1 + C

.1

log(2⇡ C log.2⇡//

)) +

log C

k ;

The

L(µ, 2) in the

Eq(real

⇣)

coordinate

wher space

e we is

T 1 .⇣/

plug in the a

T nalytic form for t2

he Gaussian ent

k, ropy. (Derivation in Appendix A.)

2

kD1

We choose a diagonal Gaussian for its eﬃcien

k c

=1 y and analytic entropy. This choice may

where we plug in

where we plug in the an c the

a a

l l

y l anal

ti t

c o ytic

fo m

r i

m n for

f d m

or tth

h e

e ˇ

of the

L

G a

a p

usla

si c

a e

n a

enˇ

K

Gaussian

p entrop

t p

r r

o o

p x

y. i y.

m (The

( a

Detriio

v nder

atitoeiv

c ation

n h

i n

n iq is

A u

p e in

p , Appendix

en wh

dix er

Ae

.) a A.)

K

X

second-order Taylor expansion

L. ; / D E

log

ˇ det

ˇ

log

q.⇣/

p X; T 1.⇣/ C log

JT 1.⇣/

C

.1 C log.2⇡// C

k ;

We chW

o e

os cehoose

a dia ag diagonal

onaarou

l GauGaussian

nsdsiath

n e

for fior

ts effi

ﬃciency

maximu

cienc .y This

m-a-p

and canal2

hoice

osteriori

ytima

c eyn call

tropto

estimate

y. mind

giv

This the

esch Laplace

aoi Gau

ce mayappro

ssian xima-

approximation to the poste-

kD1

call to mition

nd ttec

he hniq

La ue,

plac where

rior.

e appr a

oxsecond-order

Hoiw

m e

a v

ti eor,

n tueschTna

ini y

gq lor

a

ue, expansion

Gau

wh s

erseian

a s around

v

e ariat

cond- the

ioo

r n

demaximum-a-pos

arl ap

Tayplro

or xim

expan ter

sioior

ation

n i es

is timate

not equivalent to the Laplace

where

around we

th plug

giv

e es in

a the

maximu analytic

Gaussian

m-a-p for

approm of

osteriori the

ximationGaussian

to the

estimate giv entrop

poster

es ior

a y.. (The

Ho

Gau we der

v

ssianer,iv

ap ation

using

pro is

a in Appendix

Gaussian

ximation to var

the A.)

iational

poste-approxima-

approximation [18]. Our approach is distinct in another way: the posterior approximation

We

rior. choose

Howeva

tione - Algor

Algorith ith

m m

1: 1:

AuAutomatic

tomatic Di Di

ﬀeﬀ

r ernetn

i t

a itait

o io

n nVV

a a

riri

a a

titi

o o

nnaall In

Inf feerreen

n c

c e

e

Input: Dataset X = x1:N, model p(X, ✓).

Input: Dataset X = x1:N, model p(X, ✓).

Set iteration counter i = 0 and choose a stepsize sequence ⇢(i).

Set iteration counter i = 0 and choose a stepsize sequence ⇢(i).

Initialize µ(0) = 0 and !(0) = 0.

Initialize µ(0) = 0 and !(0) = 0.

while change in elbo is above some threshold do

while change in elbo is above some threshold do

Draw M samples ⌘m ⇠ N (0, I) from the standard multivariate Gaussian.

Draw M samples ⌘m ⇠ N (0, I) from the standard multivariate Gaussian.

Invert the standardization ⇣m = diag(exp(!(i)))⌘m + µ(i).

Invert the standardization ⇣m = diag(exp(!(i)))⌘m + µ(i).

Approximate rµL and r!L using mc integration (Equations 5 and 6).

Approximate rµL and r!L using mc integration (Equations 5 and 6).

Update µ(i+1) µ(i) + ⇢(i)rµL and !(i+1) !(i) + ⇢(i)r!L.

Update µ(i+1) µ(i) + ⇢(i)rµL and !(i+1) !(i) + ⇢(i)r!L.

Gradient of

Increment iteration co

Increment iteration coun

d

ELBO

unter.

end

ter.

en Return µ⇤ µ(i) and !⇤ !(i).

Return µ⇤ µ(i) and !⇤ !(i).

• maximaize ELBO

The standardization transforms the variational problem from Equation 4 into

The standardization transforms the variational problem from Equation 4 into

µ⇤, !⇤ = arg max L(µ, !)

µ,!

µ⇤, !⇤ = arg max L(µ, !)

µ,!

K

X

= arg max EN(⌘ ; 0,I

) log p(X, T 1(S 1

µ,! (⌘))) + log det JT 1 (S 1

µ,! (⌘))

+

!

K

k,

µ,!

X

k=1

= arg max E

log p(X, T 1(S 1

!

• expecta N(⌘;0,I)

µ,! (⌘))) + log det JT 1 (S 1

µ,! (⌘))

+

k,

where we dr

µ, o

! tion is in terms of

p independent term from the cal standar

culation. The e d Gaussian

xpectation is now in tkerm

=1 s of

the standard Gaussian, and both parameters µ and ! are unconstrained. (Figure 3c.) We

where we drop independent term from the calculation. The expectation is now in terms of

push the gradient inside the expectations and apply the chain rule to get

•thgr

e s

push tadient of

tandard Gaussian, and both parameters µ and ! are unconstrained. (Figure 3c.) We

⇥

he gradient insi ELBO (Appendix B)

⇤

r

de the expectations and apply the chain rule to get

µL = EN (⌘) r✓ log p(X, ✓)r⇣ T 1(⇣) + r⇣ log det JT 1 (⇣)

,

(5)

⇥

⇤

r

⇥

⇤

! L = E

r log p(X, ✓)r T 1(⇣) + r log det J

+ 1. (6)

k

N (⌘k)

✓k

⇣k

⇣k

T 1 (⇣ )

⌘k exp(!k)

rµL = EN(⌘) r✓ log p(X, ✓)r⇣T 1(⇣) + r⇣ log det JT 1(⇣) ,

(5)

⇥

⇤

r (De

! L = E

r log p(X, ✓)r T 1(⇣) + r log det J

+ 1. (6)

k

rivation

N ( s

⌘k in

)

App

✓ e

k ndix B.)

⇣k

⇣k

T 1 (⇣ )

⌘k exp(!k)

We can now compute the gradients inside the expectation with automatic diﬀerentiation.

(De Th

riv is lea

ation vses

in only

App tehned e

ixxpec

B.) tation. mc integration provides a simple approximation: draw

M

We sampl

can n es

ow frcom

om t

p he

utesttandar

he gr d

ad G

ie aussi

nts i an

nsidand

e th ev

e al

ex uat

pecet t

a he

tio empi

n wit rhical

aut mean

omati of

c d tihe

ﬀer gradi

enti en

ati ts

on.

Thi wi

s ltehain

ve tshe

onelxype

t c

h teateio

x n

pe [2

ct0a].ti Th

on. is giv

mc es

in un

te biased

gration no

p is

roy es

vid teisma

a te

s s of

imp g

le rad

ap ie

p nt

ro s of the el

ximation: bo

d .

raw

M samples from the standard Gaussian and evaluate the empirical mean of the gradients

wit 2.6

hin the Secxapla

ec b

ta le

tio Au

n [2 toma

0]. Thi tic

s gi V

vesaria

unb tion

iased alno Iisnyferen

estim c

a e

tes of gradients of the elbo.

20

Equipped with unbiased noisy gradients of the elbo, advi implements stochastic gradient

2.6ascenSt.c (Algorith

alable m 1.)

Au

We

toma ensu

tic reVcon

a vergen

ria

ce

tiona blyI ch

nfoosing

eren acedecreasing step-size schedule.

EquIn

ip p

pera

d cti

wicte,

h w

u e

nb u

i s

a e

s a

d n a

o d

isayptgiv

r e

ad sic

e h

n e

t d

s uolfet [2

h 1

e ] wi

el th

bo, fini

ad te

vi mem

imp o

le ry.

men (S

ts esetoA

c p

h pe

as nd

tic ix E

grad fo

ie r

nt

details.)

ascent. (Algorithm 1.) We ensure convergence by choosing a decreasing step-size schedule.

advi has complexity O(2NMK) per iteration, where M is the number of mc samples

In practice, we use an adaptive schedule [21] with finite memory. (See Appendix E for

(typically between 1 and 10). Coordinate ascent vi has complexity O(2NK) per pass over

details.)

the dataset. We scale advi to large datasets using stochastic optimization [3, 10]. The

advi has complexity

adjustment to Algorithm

O(2NMK) per iteration, where M is the number of mc samples

1 is simple: sample a minibatch of size B ⌧ N from the dataset

(typically between 1 and 10). Coordinate ascent vi has complexity

and scale the likelihood of the model by

O(2NK) per pass over

N/B [3]. The stochastic extension of advi has a

the pe

d r-ite

atas reat.tion

W ceom

sc ple

ale advi

xity to large datasets using stochastic optimization [3, 10]. The

O(2BMK).

adjustment to Algorithm 1 is simple: sample a minibatch of size B ⌧ N from the dataset

and scale the likelihood of the model by N/B [3]. The stochastic extension of advi has a

per-3

itera E

tio m

n cpi

o r

mpilc

e a

xi l

ty St

O udy

(2BM K).

We now study advi across a variety of models. We compare its speed and accuracy to two

3 Mar

Ekov

m ch

piairni M

caolnte Ca

St rlo (m

udy cmc) sampling algorithms: Hamiltonian Monte Carlo (hmc)

We now study advi across a variety of models. We compare its speed and accuracy to two

7

Markov chain Monte Carlo (mcmc) sampling algorithms: Hamiltonian Monte Carlo (hmc)

7 - Algorithm 1: Automatic differentiation variational inference (

)

Input: Dataset X D x1WN , model p.X; ✓/.

Set iteration counter i D 0 and choose a stepsize sequence ⇢.i/.

Initialize .0/ D 0 and !.0/ D 0.

while change in

is above some threshold do

Draw M samples ⌘m ⇠ N .0; I/ from the standard multivariate Gaussian.

Invert the standardization ⇣m D diag.exp .!.i///⌘m C .i/.

Approximate r L and r!L using

integration (Eqs. (4) and (5)).

Update .iC1/

.i / C ⇢.i/r L and !.iC1/ !.i/ C ⇢.i/r!L.

Increment iteration counter.

end

AD

Return

VI

⇤

Algorithm

.i / and !⇤ !.i/.

Algorithm 1: Automatic differentiation variational inference (

)

Input

encapsulates :

the Dataset

var

X

iational D

parameters and gives the fixed density

x1WN , model p.X; ✓/.

Set iteration counter i D 0 and choose a stepsize

K

sequence ⇢.i/.

Y

Initialize .0/ D 0

q. and

⌘ I 0 !.0/

; I/ DD

N 0

. .⌘ I 0; I/ D

N .⌘k I 0; 1/:

while change in

is above some threshold

kD1 do

The s

Draw M

tandardization

samples

transforms ⌘

the variational problem from Eq. (3) into

m ⇠ N .0; I/ from the standard multivariate Gaussian.

⇤

Invert the standardization ⇣m D diag.exp .

; !⇤

L. ; !/

!.i///⌘m C

.i /.

D arg max

Appro

;!

ximate r L and r!L using

integration (Eqs. (4) and (5)).

ˇ

ˇ

K

X

D U

ar pdate .iC1/

g max E

log

1

ˇ det

1

ˇ

N .⌘ I 0;

.i /

I/

p C

X; ⇢.i/

T 1 r

.S L and !.iC1/

;!.⌘// C log

J !.i/

T 1 S C ⇢.i /

;!.⌘/ r!L

C .

!k;

Increment

;!

iteration counter.

kD1

where end

we drop constant terms from the calculation. This expectation is with respect to a standard

R

Gaussian etur

and n ⇤

the

.i /

parameters and !⇤

and !

are !.i/

both .unconstrained (Figure 3c). We push the gradient

inside the expectations and apply the chain rule to get

⇥

ˇ

ˇ

r L D E

log

log ˇ det

ˇ⇤

N .⌘/ r✓

p.X; ✓/r⇣T 1.⇣/ C r⇣

JT 1.⇣/ ;

(4)

encapsulates ⇥

ˇ

ˇ

⇤

L

the variational

log p.X;parameters

✓/

T 1. and

⇣/

gives the

log ˇ fix

det ed

J density

r

ˇ

!

D E

r

r

C r

⌘ exp.!

C 1: (5)

k

N .⌘k /

✓k

⇣k

⇣k

T 1 .⇣/

k

k /

K

(The derivations are in Appendix B.)

Y

q.⌘ I 0; I/ D N .⌘ I 0; I/ D

N .⌘k I 0; 1/:

We can now compute the gradients inside the e 21

xpectation with automatic differentiation. The only

k

thing left is the expectation.

integration provides a simple

D1

approximation: draw M samples from

the standard Gaussian and evaluate the empirical mean of the gradients within the expectation [20].

The standardization transforms the variational problem from Eq. (3) into

This gives unbiased noisy gradients of the

for any differentiable probability model. We can

now

⇤

use ; !⇤

these D

g arg max

radients in L

a .s ;

toc !/

hastic optimization routine to automate variational inference.

;!

2.6 Automatic Variational Infer

ence

ˇ

ˇ

K

X

D arg max E

log

ˇ det

ˇ

N .⌘ I 0;I/

p X; T 1.S 1

;!.⌘// C log

JT 1 S 1;!.⌘/

C

!k;

;!

Equipped with unbiased noisy gradients of the

,

implements stochastic gradient ascent

kD1

(Algorithm 1). We ensure convergence by choosing a decreasing step-size sequence. In practice, we

use where

an

we

adaptiv drop

seq cons

uence tant

[21] terms

with

from

finite

the

memor calculation.

y. (See

This

Appendix E efxpectation

or details.) is with respect to a standard

Gaussian

has

and

comple the

xity O parameters

.2NMK/ per and !

iteration, are both

where M uncons

is the trained

number (F

of igure 3c).

samplesWe push

(typically the gradient

inside

between 1 the

and expectations

10).

and

Coordinate appl

ascenty the c

hashain rule

comple to

xity g

Oet

.2NK/ per pass over the dataset. We

scale

to large

⇥

datasets using stochastic optimization [3, 10]. ˇThe adjus

ˇ

tment to Algorithm 1 is

simple: r L D

sample a E

minibatch log

of size B ⌧ N from the dataset log

and ˇ det

scale the lik

ˇ⇤

elihood of the sampled

N .⌘/ r✓

p.X; ✓/r⇣T 1.⇣/ C r⇣

JT 1.⇣/ ;

(4)

minibatch by N=B [3].

⇥

The stochastic extension of

has per

ˇ

-iteration complexityˇO.2BMK/. ⇤

r

ˇ

ˇ

! L D E

r log p.X; ✓/r T 1.⇣/ C r log det J

⌘ exp.!

C 1: (5)

k

N .⌘k /

✓k

⇣k

⇣k

T 1 .⇣/

k

k /

(The derivations are in Appendix B.)

6

We can now compute the gradients inside the expectation with automatic differentiation. The only

thing left is the expectation.

integration provides a simple approximation: draw M samples from

the standard Gaussian and evaluate the empirical mean of the gradients within the expectation [20].

This gives unbiased noisy gradients of the

for any differentiable probability model. We can

now use these gradients in a stochastic optimization routine to automate variational inference.

2.6 Automatic Variational Inference

Equipped with unbiased noisy gradients of the

,

implements stochastic gradient ascent

(Algorithm 1). We ensure convergence by choosing a decreasing step-size sequence. In practice, we

use an adaptive sequence [21] with finite memory. (See Appendix E for details.)

has complexity O.2NMK/ per iteration, where M is the number of

samples (typically

between 1 and 10). Coordinate ascent

has complexity O.2NK/ per pass over the dataset. We

scale

to large datasets using stochastic optimization [3, 10]. The adjustment to Algorithm 1 is

simple: sample a minibatch of size B ⌧ N from the dataset and scale the likelihood of the sampled

minibatch by N=B [3]. The stochastic extension of

has per-iteration complexity O.2BMK/.

6 - Implementation…

Algorithm 1: Automatic differentiation variational inference (

)

Input: Dataset X D x1WN , model p.X; ✓/.

Set iteration counter i D 0 and choose a stepsize sequence ⇢.i/.

Initialize .0/ D 0 and !.0/ D 0.

while change in

is above some threshold do

Draw M samples ⌘m ⇠ N .0; I/ from the standard multivariate Gaussian.

Invert the standardization ⇣m D diag.exp .!.i///⌘m C .i/.

Approximate r L and r!L using

integration (Eqs. (4) and (5)).

Update .iC1/

.i / C ⇢.i/r L and !.iC1/ !.i/ C ⇢.i/r!L.

Increment iteration counter.

end

Return ⇤

.i / and !⇤ !.i/.

encapsulates the variational parameters and gives the fixed density

K

Y

q.⌘ I 0; I/ D N .⌘ I 0; I/ D

N .⌘k I 0; 1/:

kD1

The standardization transforms the variational problem from Eq. (3) into

⇤; !⇤ D arg max L. ; !/

;!

ˇ

ˇ

K

X

ht D argmax E

tps://github

log

ˇ det

.com/stan-dev/stan/blob/develop/src/stan/varia ˇ

tional/families/normal_meanfield.hpp#L400

N .⌘ I 0;I/

p X; T 1.S 1

;!.⌘// C log

JT 1 S 1;!.⌘/

C

!k;

;!

kD1

22

where we drop constant terms from the calculation. This expectation is with respect to a standard

Gaussian and the parameters

and ! are both unconstrained (Figure 3c). We push the gradient

inside the expectations and apply the chain rule to get

⇥

ˇ

ˇ

r L D E

log

log ˇ det

ˇ⇤

N .⌘/ r✓

p.X; ✓/r⇣T 1.⇣/ C r⇣

JT 1.⇣/ ;

(4)

⇥

ˇ

ˇ

⇤

r

ˇ

ˇ

! L D E

r log p.X; ✓/r T 1.⇣/ C r log det J

⌘ exp.!

C 1: (5)

k

N .⌘k /

✓k

⇣k

⇣k

T 1 .⇣/

k

k /

(The derivations are in Appendix B.)

We can now compute the gradients inside the expectation with automatic differentiation. The only

thing left is the expectation.

integration provides a simple approximation: draw M samples from

the standard Gaussian and evaluate the empirical mean of the gradients within the expectation [20].

This gives unbiased noisy gradients of the

for any differentiable probability model. We can

now use these gradients in a stochastic optimization routine to automate variational inference.

2.6 Automatic Variational Inference

Equipped with unbiased noisy gradients of the

,

implements stochastic gradient ascent

(Algorithm 1). We ensure convergence by choosing a decreasing step-size sequence. In practice, we

use an adaptive sequence [21] with finite memory. (See Appendix E for details.)

has complexity O.2NMK/ per iteration, where M is the number of

samples (typically

between 1 and 10). Coordinate ascent

has complexity O.2NK/ per pass over the dataset. We

scale

to large datasets using stochastic optimization [3, 10]. The adjustment to Algorithm 1 is

simple: sample a minibatch of size B ⌧ N from the dataset and scale the likelihood of the sampled

minibatch by N=B [3]. The stochastic extension of

has per-iteration complexity O.2BMK/.

6 - Execution on Stan

• cmdstan

• rstan

Model

Big

ADVI

Data

23 - Lin. Regression /w ARD

data {

e

e

int < lower=0> N;

// number o f data items

int < lower=0> D;

// dimension o f input f e a t u r e s

matrix [N,D]

x ;

// input matrix

3

0:7

v e c t o r [N]

y ;

// output v e c t o r

// hyperparameters f o r Gamma p r i o r s

Predictiv

0:9

5

Predictiv

r e a l < lower=0> a0 ;

ADVI (M=1)

1:1

ADVI (M=1)

r e a l < lower=0> b0 ;

r e a l < lower=0> c0 ;

r e a l < lower=0> d0 ;

Log

7

ADVI (M=10)

Log

ADVI (M=10)

}

e

1:3

NUTS

e

NUTS

parameters {

9

HMC

1:5

HMC

v e c t o r [D] w;

// weights ( c o e f f i c i e n t s ) v e c t o r

r e a l < lower=0> sigma2 ;

// v a r i a n c e

vector < lower =0>[D] alpha ; // hyper - parameters on weights

Averag

10 1

100

101

Averag

10 1

100

101

102

}

transformed parameters {

Seconds

Seconds

r e a l sigma ;

// standard d e v i a t i o n

v e c t o r [D] one_over_sqrt_alpha ; // numerical s t a b i l i t y

(a) Linear Regression with

(b) Hierarchical Logistic Regression

sigma < - s q r t ( sigma2 ) ;

f o r ( i in 1 :D) {

one_over_sqrt_alpha [ i ] < - 1 / s q r t ( alpha [ i ] ) ;

}

Figure 4: Hierarchical generalized linear models. Comparison of

to

: held-out predic-

}

tive likelihood as a function of wall time.

model {

// alpha : hyper - p r i o r on weights

alpha ~ gamma( c0 , d0 ) ;

// sigma2 : p r i o r on v a r i a n c e

sigma2 ~ inv_gamma( a0 , b0 ) ;

3 Empirical Study

// w: p r i o r on weights

w ~ normal ( 0 , sigma * one_over_sqrt_alpha ) ;

We now study

across a variety of models. We compare its speed and accuracy to two Markov

// y : l i k e l i h o o d

y ~ normal ( x * w, sigma ) ;

chain Monte Carlo (

) sampling algorithms: Hamiltonian Monte Carlo (

) [22] and the no-

}

U-turn sampler (

)6 [5]. We assess

convergence by tracking the

. To place

and

Figure 6: Stan code for Linear Regression with Automatic Relevance Determinatio on

n. a common scale, we report predictive likelihood on held-out data as a function of time. We

24

approximate the posterior predictive likelihood using a

estimate. For

, we plug in posterior

samples. For

, we draw samples from the posterior approximation during the optimization. We

initialize

with a draw from a standard Gaussian.

We explore two hierarchical regression models, two matrix factorization models, and a mixture

model. All of these models have nonconjugate prior structures. We conclude by analyzing a dataset

of 250 000 images, where we report results across a range of minibatch sizes B.

3.1 A Comparison to Sampling: Hierarchical Regression Models

16

We begin with two nonconjugate regression models: linear regression with automatic relevance de-

termination (

) [16] and hierarchical logistic regression [23].

Linear Regression with

. This is a sparse linear regression model with a hierarchical prior

structure. (Details in Appendix F.) We simulate a dataset with 250 regressors such that half of the

regressors have no predictive power. We use 10 000 training samples and hold out 1000 for testing.

Logistic Regression with Spatial Hierarchical Prior. This is a hierarchical logistic regression

model from political science. The prior captures dependencies, such as states and regions, in a

polling dataset from the United States 1988 presidential election [23]. (Details in Appendix G.)

We train using 10 000 data points and withhold 1536 for evaluation. The regressors contain age,

education, state, and region indicators. The dimension of the regression problem is 145.

Results. Figure 4 plots average log predictive accuracy as a function of time. For these simple

models, all methods reach the same predictive accuracy. We study

with two settings of M , the

number of

samples used to estimate gradients. A single sample per iteration is sufficient; it is

also the fastest. (We set M D 1 from here on.)

3.2 Exploring Nonconjugacy: Matrix Factorization Models

We continue by exploring two nonconjugate non-negative matrix factorization models: a constrained

Gamma Poisson model [24] and a Dirichlet Exponential model. Here, we show how easy it is to

explore new models using

. In both models, we use the Frey Face dataset, which contains 1956

frames (28 ⇥ 20 pixels) of facial expressions extracted from a video sequence.

Constrained Gamma Poisson. This is a Gamma Poisson factorization model with an ordering

constraint: each row of the Gamma matrix goes from small to large values. (Details in Appendix H.)

6

is an adaptive extension of

. It is the default sampler in Stan.

7 - Hierarchical Logistic Reg

data {

e

int < lower=0> N;

int < lower=0> n_age ;

int < lower=0> n_age_edu ;

e

int < lower=0> n_edu ;

int < lower=0> n_region_full ;

3 int <lower=0> n_state ;

0:7

int < lower =0, upper=n_age > age [N ] ;

int < lower =0, upper=n_age_edu> age_edu [N ] ;

Predictiv

vector < lower =0, upper =1>[N] black ;

0:9

5 int <lower=0,upper=n_edu> edu [N] ;

Predictiv

vector < lower =0, upper =1>[N] female ;

ADVI (M=1)

1:1

ADVI (M=1)

int

Log

< lower =0, upper=n_region_full > r e g i o n _ f u l l [N ] ;

7 int <lower=0,upper=n_state > state [N] ; ADVI (M=10)

v e c t o r [N] v_prev_full ;

Log

ADVI (M=10)

e

1:3

int < lower =0, upper=1> y [N ] ;

NUTS

e

NUTS

}

9parameters {

HMC

1:5

HMC

v e c t o r [ n_age ] a ;

v e c t o r [ n_edu ] b ;

Averag

v e c t 10

o r [ n_a1

ge_edu ] c ;

v e c t o r [ n_state ] d ;

100

101

Averag

10 1

100

101

102

v e c t o r [ n_region_ful Seconds

l ] e ;

v e c t o r [ 5 ] beta ;

r e a l < lower =0, upper=100> sigma_a ;

Seconds

r e a l < lower =0, upper=100> sigma_b ;

r e a l < (a)

lower Linear

=0, upper

R

=100 eg

> si ression

gma_c ;

r e a l < lower =0, upper=100> sigma_d ;

with

(b) Hierarchical Logistic Regression

r e a l < lower =0, upper=100> sigma_e ;

}

Figure 4:

transf Hierarc

ormed para

hical

meters {

v e c t o r [N] y_hat ;

generalized linear models. Comparison of

to

: held-out predic-

tive likelihood

f o r ( i in 1 as

:N) a function of wall time.

y_hat [ i ] < - beta [ 1 ]

+ beta [ 2 ] * black [ i ]

+ beta [ 3 ] * female [ i ]

+ beta [ 5 ] * female [ i ] * black [ i ]

3 Empirical+ Study

beta [ 4 ] * v_prev_full [ i ]

+ a [ age [ i ] ]

+ b [ edu [ i ] ]

+ c [ age_edu [ i ] ]

+ d [ s t a t e [ i ] ]

We now study + e[ reacross

g i o n _ f u l l [ ai]]v;

}

model {

ariety of models. We compare its speed and accuracy to two Markov

chain Monte

a ~ norm Car

al ( 0 , losig (ma_a) ;

b ~ normal ( 0 , sigma_b ) ;

) sampling algorithms: Hamiltonian Monte Carlo (

) [22] and the no-

U-turn sampler

c ~ normal ( (0, sigma_c))6;

d ~ normal ( 0 , sigma_d ) ; [5]. We assess

convergence by tracking the

. To place

and

on

e ~ a common

normal ( 0 , sigma scale,

_e ) ;

beta ~ normal ( 0 , 100) ;

we report predictive likelihood on held-out data as a function of time. We

y ~ b e r n o u l l i _ l o g i t ( y_hat ) ;

25

approximate

}

the posterior predictive likelihood using a

estimate. For

, we plug in posterior

samples. For

, we draw samples from the posterior approximation during the optimization. We

initialize

Figure 7: Stan with

code fo ar dra

Hiera w

rchi from

cal Logi asti sctandard

Regression, Gaussian.

from [4].

We explore two hierarchical regression models, two matrix factorization models, and a mixture

model. All of these models have nonconjugate prior structures. We conclude by analyzing a dataset

of 250 000 images, where we report results across a range of minibatch sizes B.

3.1 A Comparison to Sampling: Hierarchical Regression Models

17

We begin with two nonconjugate regression models: linear regression with automatic relevance de-

termination (

) [16] and hierarchical logistic regression [23].

Linear Regression with

. This is a sparse linear regression model with a hierarchical prior

structure. (Details in Appendix F.) We simulate a dataset with 250 regressors such that half of the

regressors have no predictive power. We use 10 000 training samples and hold out 1000 for testing.

Logistic Regression with Spatial Hierarchical Prior. This is a hierarchical logistic regression

model from political science. The prior captures dependencies, such as states and regions, in a

polling dataset from the United States 1988 presidential election [23]. (Details in Appendix G.)

We train using 10 000 data points and withhold 1536 for evaluation. The regressors contain age,

education, state, and region indicators. The dimension of the regression problem is 145.

Results. Figure 4 plots average log predictive accuracy as a function of time. For these simple

models, all methods reach the same predictive accuracy. We study

with two settings of M , the

number of

samples used to estimate gradients. A single sample per iteration is sufficient; it is

also the fastest. (We set M D 1 from here on.)

3.2 Exploring Nonconjugacy: Matrix Factorization Models

We continue by exploring two nonconjugate non-negative matrix factorization models: a constrained

Gamma Poisson model [24] and a Dirichlet Exponential model. Here, we show how easy it is to

explore new models using

. In both models, we use the Frey Face dataset, which contains 1956

frames (28 ⇥ 20 pixels) of facial expressions extracted from a video sequence.

Constrained Gamma Poisson. This is a Gamma Poisson factorization model with an ordering

constraint: each row of the Gamma matrix goes from small to large values. (Details in Appendix H.)

6

is an adaptive extension of

. It is the default sampler in Stan.

7 - Gamma Poisson Non-Neg

data {

int < lower=0> U;

e

e

int < lower=0> I ;

int < lower=0> K;

int < lower=0> y [U, I ] ;

r e a l < lower=0> a ;

5

0

r e a l < lower=0> b ;

Predictiv

7

Predictiv

200

r e a l < lower=0> c ;

r e a l < lower=0> d ;

Log

9

Log

400

}

e

ADVI

e

ADVI

11

NUTS

600

NUTS

parameters {

p o s i t i v e _ o r d e r e d [K] theta [U ] ; // user p r e f e r e n c e

101

102

103

104

vector < lower =0>[K] beta [ I ] ;

// item a t t r i b u t e s

Averag

101

102

103

104

Averag

}

Seconds

Seconds

model {

(a) Gamma Poisson Predictive Likelihood

(b) Dirichlet Exponential Predictive Likelihood

f o r ( u in 1 :U)

theta [ u ] ~ gamma( a , b ) ; // componentwise gamma

f o r ( i in 1 : I )

beta [ i ] ~ gamma( c , d ) ; // componentwise gamma

f o r ( u in 1 :U) {

f o r ( i in 1 : I ) {

increment_log_prob (

poisson_log ( y [ u , i ] , theta [ u ] ‘ * beta [ i ] ) ) ;

}

}

(c) Gamma Poisson Factors

(d) Dirichlet Exponential Factors

}

Figure 5: Non-negative matrix factorization of the Frey Faces dataset. Comparison of

to

Figure 8: Stan code for Gamma Poisson non-negative matrix facto :ri held-out

zation

predictiv

model.

e likelihood as a function of wall time.

Dirichlet Exponential. This is a nonconjugate Dirichlet Exponential factorization model with a

26

Poisson likelihood. (Details in Appendix I.)

Results. Figure 5 shows average log predictive accuracy as well as ten factors recovered from both

data {

models.

provides an order of magnitude speed improvement over

(Figure 5a).

int < lower=0> U;

struggles with the Dirichlet Exponential model (Figure 5b). In both cases,

does not produce

int < lower=0> I ;

int < lower=0> K;

any useful samples within a budget of one hour; we omit

from the plots.

int < lower=0> y [U, I ] ;

r e a l < lower=0> lambda0 ;

r e a l < lower=0> alpha0 ;

3.3 Scaling to Large Datasets: Gaussian Mixture Model

}

We conclude with the Gaussian mixture model (

) example we highlighted earlier. This is a

transformed data {

vector < lower =0>[K] alpha0_vec ;

nonconjugate

applied to color image histograms. We place a Dirichlet prior on the mixture

f o r ( k in 1 :K) {

proportions, a Gaussian prior on the component means, and a lognormal prior on the standard devi-

alpha0_vec [ k ] < - alpha0 ;

}

ations. (Details in Appendix J.) We explore the image

dataset, which has 250 000 images [25].

}

We withhold 10 000 images for evaluation.

parameters {

In Figure 1a we randomly select 1000 images and train a model with 10 mixture components.

simplex [K] theta [U ] ;

// user p r e f e r e n c e

vector < lower =0>[K] beta [ I ] ;

// item a t t r i b u t e s

struggles to find an adequate solution and

fails altogether. This is likely due to label switching,

}

which can affect

-based techniques in mixture models [4].

model {

Figure 1b shows

results on the full dataset. Here we use

with stochastic subsampling

f o r ( u in 1 :U)

theta [ u ] ~ d i r i c h l e t ( alpha0_vec ) ; // componentwis of

e minibatc

d i r i c h l e t hes from the dataset [3]. We increase the number of mixture components to 30. With a

f o r ( i in 1 : I )

minibatch size of 500 or larger,

reaches high predictive accuracy. Smaller minibatch sizes lead

beta [ i ] ~ e x p o n e n t i a l ( lambda0 ) ;

// componentwis to

e suboptimal

gamma

solutions, an effect also observed in [3].

converges in about two hours.

f o r ( u in 1 :U) {

f o r ( i in 1 : I ) {

increment_log_prob (

4 Conclusion

poisson_log ( y [ u , i ] , theta [ u ] ‘ * beta [ i ] ) ) ;

}

}

We develop automatic differentiation variational inference (

) in Stan.

leverages automatic

}

transformations, an implicit non-Gaussian variational approximation, and automatic differentiation.

This is a valuable tool. We can explore many models and analyze large datasets with ease. We

Figure 9: Stan code for Dirichlet Exponential non-negative matemphasize

rix factori that

zation mo is cur

del. rently available as part of Stan; it is ready for anyone to use.

Acknowledgments

We thank Dustin Tran, Bruno Jacobs, and the reviewers for their comments. This work is supported

18

by NSF IIS-0745520, IIS-1247664, IIS-1009542, SES-1424962, ONR N00014-11-1-0651, DARPA

FA8750-14-2-0009, N66001-15-C-4032, Sloan G-2015-13987, IES DE R305D140059, NDSEG,

Facebook, Adobe, Amazon, and the Siebel Scholar and John Templeton Foundations.

8 - data {

int < lower=0> U;

int < lower=0> I ;

int < lower=0> K;

int < lower=0> y [U, I ] ;

r e a l < lower=0> a ;

r e a l < lower=0> b ;

r e a l < lower=0> c ;

r e a l < lower=0> d ;

}

parameters {

p o s i t i v e _ o r d e r e d [K] theta [U ] ; // user p r e f e r e n c e

vector < lower =0>[K] beta [ I ] ;

// item a t t r i b u t e s

}

model {

f o r ( u in 1 :U)

theta [ u ] ~ gamma( a , b ) ; // componentwise gamma

f o r ( i in 1 : I )

beta [ i ] ~ gamma( c , d ) ; // componentwise gamma

f o r ( u in 1 :U) {

f o r ( i in 1 : I ) {

increment_log_prob (

poisson_log ( y [ u , i ] , theta [ u ] ‘ * beta [ i ] ) ) ;

}

}

}

Figure 8: Stan code for Gamma Poisson non-negative matrix factorization model.

Dirichlet Exponential NonNeg

data {

int <

e

lower=0> U;

e

int < lower=0> I ;

int < lower=0> K;

int < lower=0> y [U, I ] ;

r e a l < lower=0 5

> lambda0 ;

0

r e a l < lo Predictivwer=0> alpha0 ;

}

7

Predictiv

200

transf

Log

ormed da 9

ta {

Log

400

vector < e

ADVI

lower =0>[K] alpha0_vec ;

e

ADVI

f o r ( k in 1 11

NUTS

:K) {

600

NUTS

alpha0_vec [ k ] < - alpha0 ;

}

101

102

103

104

}

Averag

101

102

103

104

Averag

Seconds

Seconds

parameters {

simplex [K (a)

] t Gamma

heta [U ] ; Poisson Predictiv

// use e

r Lik

p r e elihood

f e r e n c e

(b) Dirichlet Exponential Predictive Likelihood

vector < lower =0>[K] beta [ I ] ;

// item a t t r i b u t e s

}

model {

f o r ( u in 1 :U)

theta [ u ] ~ d i r i c h l e t ( alpha0_vec ) ; // componentwise d i r i c h l e t

f o r ( i in 1 : I )

beta [ i ] ~ e x p o n e n t i a l ( lambda0 ) ;

// componentwise gamma

f o r ( u in 1 :U) {

f o r ( i in 1 : I ) {

increment_l (c)

og_p Gamma

rob (

poisson_log ( y [ u , i ] , P

t oisson

heta [ u ] F

‘ actors

* beta [ i ] ) ) ;

(d) Dirichlet Exponential Factors

}

}

}

Figure 5: Non-negative matrix factorization of the Frey Faces dataset. Comparison of

to

: held-out predictive likelihood as a function of wall time.

Figure 9: Sta Diric

n codehle

fo tr Exponential.

Dirichlet Expo This

nential isnoa nonconjug

n-negative

ate

matri Dir

x fa ic

ct hlet

oriza Exponential

tion model. factorization model with a

Poisson likelihood. (Details in Appendix 27

I.)

Results. Figure 5 shows average log predictive accuracy as well as ten factors recovered from both

models.

provides an order

18

of magnitude speed improvement over

(Figure 5a).

struggles with the Dirichlet Exponential model (Figure 5b). In both cases,

does not produce

any useful samples within a budget of one hour; we omit

from the plots.

3.3 Scaling to Large Datasets: Gaussian Mixture Model

We conclude with the Gaussian mixture model (

) example we highlighted earlier. This is a

nonconjugate

applied to color image histograms. We place a Dirichlet prior on the mixture

proportions, a Gaussian prior on the component means, and a lognormal prior on the standard devi-

ations. (Details in Appendix J.) We explore the image

dataset, which has 250 000 images [25].

We withhold 10 000 images for evaluation.

In Figure 1a we randomly select 1000 images and train a model with 10 mixture components.

struggles to find an adequate solution and

fails altogether. This is likely due to label switching,

which can affect

-based techniques in mixture models [4].

Figure 1b shows

results on the full dataset. Here we use

with stochastic subsampling

of minibatches from the dataset [3]. We increase the number of mixture components to 30. With a

minibatch size of 500 or larger,

reaches high predictive accuracy. Smaller minibatch sizes lead

to suboptimal solutions, an effect also observed in [3].

converges in about two hours.

4 Conclusion

We develop automatic differentiation variational inference (

) in Stan.

leverages automatic

transformations, an implicit non-Gaussian variational approximation, and automatic differentiation.

This is a valuable tool. We can explore many models and analyze large datasets with ease. We

emphasize that

is currently available as part of Stan; it is ready for anyone to use.

Acknowledgments

We thank Dustin Tran, Bruno Jacobs, and the reviewers for their comments. This work is supported

by NSF IIS-0745520, IIS-1247664, IIS-1009542, SES-1424962, ONR N00014-11-1-0651, DARPA

FA8750-14-2-0009, N66001-15-C-4032, Sloan G-2015-13987, IES DE R305D140059, NDSEG,

Facebook, Adobe, Amazon, and the Siebel Scholar and John Templeton Foundations.

8 - GMM

data {

int < lower=0> N; // number o f data p o i n t s in e n t i r e d a t a s e t

int < lower=0> K; // number o f mixture components

int < lower=0> D; // dimension

e

e

v e c t o r [D] y [N ] ; // o b s e r v a t i o n s

0

400

r e a l < lower=0> alpha0 ;

// d i r i c h l e t p r i o r

r e a l < lower=0> mu_sigma0 ;

// means p r i o r

r e a l < lower=0> sigma_sigma0 ;

// v a r i a n c e s p r i o r

Predictiv

300

Predictiv

0

}

Log

Log

B=50

400

transformed data {

600

e

e

B=100

vector < lower =0>[K] alpha0_vec ;

ADVI

B=500

f o r ( k in 1 :K) {

900

800

alpha0_vec [ k ] < - alpha0 ;

NUTS [5]

B=1000

}

Averag

Averag

}

102

103

102

103

104

parameters {

Seconds

Seconds

simplex [K] theta ;

// mixing p r o p o r t i o n s

v e c t o r [D] mu[K ] ;

// l o c a t i o n s o f mixture components

vector < lower =0>[D] sigma [K ] ;

// standard d e v i a t i o n s o f mixture components

(a) Subset of 1000 images

(b) Full dataset of 250 000 images

}

model {

Figure 1: Held-out predictive accuracy results | Gaussian mixture model (

) of the image

// p r i o r s

image histogram dataset. (a)

outperforms the no-U-turn sampler (

), the default sampling

theta ~ d i r i c h l e t ( alpha0_vec ) ;

f o r ( k in 1 :K) {

method in Stan [5]. (b)

scales to large datasets by subsampling minibatches of size B from the

mu[ k ] ~ normal ( 0 . 0 , mu_sigma0 ) ;

dataset at each iteration [3]. We present more details in Section 3.3 and Appendix J.

sigma [ k ] ~ lognormal ( 0 . 0 , sigma_sigma0 ) ;

}

// l i k e l i h o o d

f o r ( n in 1 :N) {

Figure 1 illustrates the advantages of our method. Consider a nonconjugate Gaussian mixture model

r e a l ps [K ] ;

f o r ( k in 1 :K) {

for analyzing natural images; this is 40 lines in Stan (Figure 10). Figure 1a illustrates Bayesian

ps [ k ] < - l o g ( theta [ k ] ) + normal_log ( y [ n ] , mu[ k ] , sigma [ k ] ) ; inference on

}

1000 images. The y-axis is held-out likelihood, a measure of model fitness; the x-

increment_log_prob ( log_sum_exp ( ps ) ) ;

axis is time on a log scale.

is orders of magnitude faster than

, a state-of-the-art

}

algorithm (and Stan’s default inference technique) [5]. We also study nonconjugate factorization

}

models and hierarchical generalized linear models in Section 3.

Figure 10: advi Stan code for the 28

gmm example.Figure 1b illustrates Bayesian inference on 250 000 images, the size of data we more commonly find in

machine learning. Here we use

with stochastic variational inference [3], giving an approximate

posterior in under two hours. For data like these,

techniques cannot complete the analysis.

Related work.

automates variational inference within the Stan probabilistic programming

system [4]. This draws on two major themes.

The first is a body of work that aims to generalize . Kingma and Welling [6] and Rezende et al.

[7] describe a reparameterization of the variational problem that simplifies optimization. Ranganath

et al. [8] and Salimans and Knowles [9] propose a black-box technique, one that only requires the

model and the gradient of the approximating family. Titsias and Lázaro-Gredilla [10] leverage the

gradient of the joint density for a small class of models. Here we build on and extend these ideas to

automate variational inference; we highlight technical connections as we develop the method.

The second theme is probabilistic programming. Wingate and Weber [11] study

in general proba-

bilistic programs, as supported by languages like Church [12], Venture [13], and Anglican [14]. An-

19

other probabilistic programming system is infer.NET, which implements variational message passing

[15], an efficient algorithm for conditionally conjugate graphical models. Stan supports a more com-

prehensive class of nonconjugate models with differentiable latent variables; see Section 2.1.

2 Automatic Differentiation Variational Inference

Automatic differentiation variational inference (

) follows a straightforward recipe. First we

transform the support of the latent variables to the real coordinate space. For example, the logarithm

transforms a positive variable, such as a standard deviation, to the real line. Then we posit a Gaussian

variational distribution to approximate the posterior. This induces a non-Gaussian approximation in

the original variable space. Last we combine automatic differentiation with stochastic optimization

to maximize the variational objective. We begin by defining the class of models we support.

2.1 Differentiable Probability Models

Consider a dataset X D x1WN with N observations. Each xn is a discrete or continuous random vec-

tor. The likelihood p.X j ✓/ relates the observations to a set of latent random variables ✓. Bayesian

2 - GMM /w Stoch.Subsamp

data {

r e a l < lower=0> N; // number o f data p o i n t s in e n t i r e d a t a s e t

int < lower=0>

S_in_minibatch ;

e

e

int < lower=0> K; // number o f mixture components

int < lower=0> D; // dimension

0

400

v e c t o r [D] y [ S_in_minibatch ] ; // o b s e r v a t i o n s

r e a l < lower=0> alpha0 ;

// d i r i c h l e t p r i o r

Predictiv

Predictiv

0

r e a l < lower=0> mu_

300 sigma0;

// means p r i o r

r e a l < lower=0> sigma_sigma0 ;

// v a r i a n c e s p r i o r

B=50

}

Log

600

Log

400

B=100

transformeed data {

e

r e a l SVI_factor ;

ADVI

B=500

vector < lower =0>[K] alpha0_vec ;

900

NUTS [5]

800

B=1000

f o r ( k in 1 :K) {

alpha0_vec [ k ] < - alpha0 ;

}

Averag

Averag

SVI_factor < - N / S_in_minibatch ; 102

103

102

103

104

}

Seconds

Seconds

parameters {

simplex [K] theta ;

// mixing p r o p o r t i o n s

v e c t o r [D] mu[K ] ;

// l o c a t i o n s o f mixture components

(a) Subset of

(b) Full dataset of

vector < lower =0>[D] sigma [K ] ;

// standard

1000 images

d e v i a t i o n s o f mixture components

250 000 images

}

model {

Figure 1: Held-out predictive accuracy results | Gaussian mixture model (

) of the image

// p r i o r s

theta ~ d

imag ireichl

hiset (a

toglpha0

ram_vec) ;

dataset. (a)

outperforms the no-U-turn sampler (

), the default sampling

f o r ( k in 1 :K) {

mu[ k ] ~ nor

method mal

in S(0.0

tan , mu_

[5]. sigma

(b)0) ;

scales to large datasets by subsampling minibatches of size B from the

sigma [ k ] ~ lognormal ( 0 . 0 , sigma_sigma0 ) ;

} dataset at each iteration [3]. We present more details in Section 3.3 and Appendix J.

// l i k e l i h o o d

f o r ( n in 1 : S_in_minibatch ) {

r e a l ps [K ] ;

f o r ( k in 1 :K) {

ps [ k ] < - l o g ( theta [ k ] ) + normal_log ( y [ n ] , mu[ k ] , sigma [ k ] ) ;

Figure 1 illustrates the advantages of our method. Consider a nonconjugate Gaussian mixture model

}infcreme

or nt_l

analog_prob

yzing (log_sum_e

naturalxp(ps))

imag;es; this is 40 lines in Stan (Figure 10). Figure 1a illustrates Bayesian

}increme

inf nt_log_p

erence rob(

on log(SVI

1000_factor )

imag ) ;

es. The y-axis is held-out likelihood, a measure of model fitness; the x-

}

axis is time on a log scale.

is orders of magnitude faster than

, a state-of-the-art

29

Figure 11: ad

algor vi St

ithm an code

(and fo

S r t

tan he

’s gm

defm ex

aultampl

inf e, with

erence stoch

tec astic

hniq sub

ue)sampl

[5]. ing o

W fe the

also study nonconjugate factorization

dataset. models and hierarchical generalized linear models in Section 3.

Figure 1b illustrates Bayesian inference on 250 000 images, the size of data we more commonly find in

machine learning. Here we use

with stochastic variational inference [3], giving an approximate

posterior in under two hours. For data like these,

techniques cannot complete the analysis.

Related work.

automates variational inference within the Stan probabilistic programming

system [4]. This draws on two major themes.

The first is a body of work that aims to generalize . Kingma and Welling [6] and Rezende et al.

[7] describe a reparameterization of the variational problem that simplifies optimization. Ranganath

et al. [8] and Salimans and20Knowles [9] propose a black-box technique, one that only requires the

model and the gradient of the approximating family. Titsias and Lázaro-Gredilla [10] leverage the

gradient of the joint density for a small class of models. Here we build on and extend these ideas to

automate variational inference; we highlight technical connections as we develop the method.

The second theme is probabilistic programming. Wingate and Weber [11] study

in general proba-

bilistic programs, as supported by languages like Church [12], Venture [13], and Anglican [14]. An-

other probabilistic programming system is infer.NET, which implements variational message passing

[15], an efficient algorithm for conditionally conjugate graphical models. Stan supports a more com-

prehensive class of nonconjugate models with differentiable latent variables; see Section 2.1.

2 Automatic Differentiation Variational Inference

Automatic differentiation variational inference (

) follows a straightforward recipe. First we

transform the support of the latent variables to the real coordinate space. For example, the logarithm

transforms a positive variable, such as a standard deviation, to the real line. Then we posit a Gaussian

variational distribution to approximate the posterior. This induces a non-Gaussian approximation in

the original variable space. Last we combine automatic differentiation with stochastic optimization

to maximize the variational objective. We begin by defining the class of models we support.

2.1 Differentiable Probability Models

Consider a dataset X D x1WN with N observations. Each xn is a discrete or continuous random vec-

tor. The likelihood p.X j ✓/ relates the observations to a set of latent random variables ✓. Bayesian

2 - Stochastic Subsampling?

data {

r e a l < lower=0> N; // number o f data p o i n t s in e n t i r e d a t a s e t

int < lower=0>

S_in_minibatch ;

data {

int < lower=0> K; // number o f mixture components

int < lower=0> N; // number o f data p o i n t s in e n t i r e d a t a s e t

int < lower=0> D; // dimension

int < lower=0> K; // number o f mixture components

int < lower=0> D; // dimension

v e c t o r [D] y [ S_in_minibatch ] ; // o b s e r v a t i o n s

v e c t o r [D] y [N ] ; // o b s e r v a t i o n s

r e a l < lower=0> alpha0 ;

// d i r i c h l e t p r i o r

r e a l < lower=0> alpha0 ;

// d i r i c h l e t p r i o r

r e a l < lower=0> mu_sigma0 ;

// means p r i o r

r e a l < lower=0> sigma_sigma0 ;

// v a r i a n c e s p r i o r

r e a l < lower=0> mu_sigma0 ;

// means p r i o r

}

r e a l < lower=0> sigma_sigma0 ;

// v a r i a n c e s p r i o r

}

transformed data {

r e a l SVI_factor ;

transformed data {

vector < lower =0>[K] alpha0_vec ;

vector < lower =0>[K] alpha0_vec ;

f o r ( k in 1 :K) {

f o r ( k in 1 :K) {

alpha0_vec [ k ] < - alpha0 ;

alpha0_vec [ k ] < - alpha0 ;

}

}

SVI_factor < - N / S_in_minibatch ;

}

}

parameters {

parameters {

simplex [K] theta ;

// mixing p r o p o r t i o n s

simplex [K] theta ;

// mixing p r o p o r t i o n s

v e c t o r [D] mu[K ] ;

// l o c a t i o n s o f mixture components vector [D] mu[K ] ;

// l o c a t i o n s o f mixture components

vector < lower =0>[D] sigma [K ] ;

// standard d e v i a t i o n s o f mixture comp

vect on

or en

< t

lo swer=0>[D] sigma [K] ; // standard deviations of mixture components

}

}

model {

model {

// p r i o r s

// p r i o r s

theta ~ d i r i c h l e t ( alpha0_vec ) ;

theta ~ d i r i c h l e t ( alpha0_vec ) ;

f o r ( k in 1 :K) {

f o r ( k in 1 :K) {

mu[ k ] ~ normal ( 0 . 0 , mu_sigma0 ) ;

mu[ k ] ~ normal ( 0 . 0 , mu_sigma0 ) ;

sigma [ k ] ~ lognormal ( 0 . 0 , sigma_sigma0 ) ;

sigma [ k ] ~ lognormal ( 0 . 0 , sigma_sigma0 ) ;

}

}

// l i k e l i h o o d

// l i k e l i h o o d

f o r ( n in 1 : S_in_minibatch ) {

f o r ( n in 1 :N) {

r e a l ps [K ] ;

r e a l ps [K ] ;

f o r ( k in 1 :K) {

f o r ( k in 1 :K) {

ps [ k ] < - l o g ( theta [ k ] ) + normal_log ( y [ n ] , mu[ k ] , sigma [ k ] ) ;

ps [ k ] < - l o g ( theta [ k ] ) + normal_log ( y [ n ] , mu[ k ] , sigma [ k ] ) ;

}increment_log_prob(log_sum_exp(ps)) ;

}

}

increment_log_prob ( log_sum_exp ( ps ) ) ;

increment_log_prob ( l o g ( SVI_factor ) ) ;

}

}

}

30

Figure 11: advi Stan code for the gmm example, with stochastic subsampling of the

Figure 10: advi Stan code for the gmm example.

dataset.

20

19 - ADVI: 8 schools (BDA)

31 - 8 schools: result

32 - rats: stan model

33 - rats: R

34 - rats: result Very Different…

35 - ADVI

• Highly sensitive to initial values

• Highly sensitive to some parameters

• So, need to run multiple inits for now

https://groups.google.com/forum/#!msg/stan-users/FaBvi8w7pc4/qnIFPEWSAQAJ

36 - Resources
- Resources

• ADVI - 10 minute presentation

https://www.youtube.com/watch?v=95bpsWr1lJ8

• Variational Inference: A Review for

Statisticians

http://www.proditus.com/papers/BleiKucukelbirMcAuliffe2016.pdf

• Stan Modeling Language Users Guide and

Reference Manual

https://github.com/stan-dev/stan/releases/download/v2.9.0/stan-reference-2.9.0.pdf

38