このページは http://www.slideshare.net/marcoBan/hmc-and-nuts の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

3年以上前 (2013/04/25)にアップロードinテクノロジー

Presentation of the NUTS Algorithm by M. Hoffmann and A. Gelman

(disclamer: informal work, the h...

Presentation of the NUTS Algorithm by M. Hoffmann and A. Gelman

(disclamer: informal work, the huge amount of interesting work by R.Neal is not entirely referenced)

- Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

No - U - Turn Sampler

based on M.D.Hoffmann and A.Gelman (2011)

Marco Banterle

Presented at the ”Bayes in Paris” reading group

11 / 04 / 2013 - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

Hamiltonian dynamic

Hamiltonian MC techniques have a nice foundation in physics

Describe the total energy of a system composed by a frictionless

particle sliding on a (hyper-)surface

Position q ∈

d

d

R

and momentum p ∈ R are necessary to define the

total energy H(q, p), which is usually formalized through the sum

of a potential energy term U(q) and a kinetic energy term K(p).

Hamiltonian dynamics are characterized by

∂q

∂H

∂p

∂H

=

,

= −

∂t

∂p

∂t

∂q

that describe how the system changes though time. - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

Properties of the Hamiltonian

• Reversibility : the mapping from (q, p)t to (q, p)t+s is 1-to-1

• Conservation of the Hamiltonian : ∂H = 0

∂t

total energy is conserved

• Volume preservation : applying the map resulting from

time-shifting to a region R of the (q, p)-space do not change

the volume of the (projected) region

• Symplecticness : stronger condition than volume

preservation. - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

What about MCMC?

H(q, p) = U(q) + K(p)

We can easily interpret U(q) as minus the log target density for

the variable of interest q, while p will be introduced artificially. - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

What about MCMC?

H(q, p) = U(q) + K(p) → − log (π(q|y )) − log(f (p))

Let’s examine its properties under a statistical lens:

• Reversibility : MCMC updates that use these dynamics leave

the desired distribution invariant

• Conservation of the Hamiltonian : Metropolis updates

using Hamiltonians are always accepted

• Volume preservation : we don’t need to account for any

change in volume in the acceptance probability for Metropolis

updates (no need to compute the determinant of the Jacobian

matrix for the mapping) - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

the Leapfrog

Hamilton’s equations are not always1 explicitly available, hence the

need for time discretization with some small step-size ε.

A numerical integrator which serves our scopes is called Leapfrog

Leapfrog Integrator

For j = 1, . . . , L

1 pt+ε/2 = pt − (ε/2) ∂U (q

∂q

t )

2 qt+ε = qt + ε ∂K (p

∂p

t+ε/2)

3 pt+ε = pt+ε/2 − (ε/2) ∂U (q

∂q

t+ε)

t = t + ε

1unless they’re quadratic form - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

A few more words

The kinetic energy is usually (for simplicity) assumed to be

K(p) = pT M−1p which correspond to p|q ≡ p ∼ N (0, M).

This finally implies that in the leapfrog ∂K (p

∂p

t+ε/2) = M −1pt+ε/2

Usually M, for which few guidance exists, is taken to be diagonal

and often equal to the identity matrix.

The leapfrog method preserves volume exactly and due to its

symmetry it is also reversible by simply negating p, applying the

same number L of steps again, and then negating p again2.

It does not however conserve the total energy and thus this

deterministic move to (q , p ) will be accepted with probability

min 1, exp(−H(q , p ) + H(q, p))

2negating ε serves the same purpose - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

HMC algorithm

We now have all the elements to construct an MCMC method

based on the Hamiltoninan dynamics:

HMC

Given q0, ε, L, M

For i = 1, . . . , N

1 p ∼ N (0, M )

2 Set qi ← qi−1,

q ← qi−1,

p ← p

3 for j = 1, . . . , L

update(q , p )through leapfrog

4 Set qi ← q with probability

min 1, exp(−H(q , p ) + H(qi−1, pi−1)) - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

HMC benefits

HMC make use of gradient information to move across the space

and so its typical trajectories do not resemble random-walks.

Moreover the error in the Hamiltonian stays bounded3 and hence

we also have high acceptance rate.

3even if theoretical justification for that are missing - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

HMC benefits

Random-walk:

• require changes proposed with magnitude comparable to the

sd in the most constrained direction (square root of the

smallest eigenvalue of the covariance matrix)

• to reach a almost-independent state need a number of

iterations mostly determined by how long it takes to explore

the less constrained direction

• proposals have no tendency to move consistently in the same

direction.

For HMC the proposal move accordingly to the gradient for L step,

even if ε is still constrained by the smallest eigenvalue. - even worse: PERIODICITY!!

Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

Connected difficulties

Performance depends strongly on choosing suitable values for ε

and L.

• ε too large → inaccurate simulation & high reject rate

• ε too small → wasted computational power (small steps)

• L too small → random walk behavior and slow mixing

• L too large → trajectories retrace their steps → U-TURN - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

Connected difficulties

Performance depends strongly on choosing suitable values for ε

and L.

• ε too large → inaccurate simulation & high reject rate

• ε too small → wasted computational power (small steps)

• L too small → random walk behavior and slow mixing

• L too large → trajectories retrace their steps → U-TURN

even worse: PERIODICITY!! - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

Periodicity

Ergodicity of HMC may fail if a produced trajectory of length (Lε)

is an exact periodicity for some state.

Example

Consider q ∼ N (0, 1), then H(q, p) = q2/2 + p2/2 and the

resulting Hamiltonian is

dq

dp

= p

= −q

dt

dt

and hence has an exact solution

q(Lε) = r cos(a + Lε),

p(Lε) = −r sin(a + Lε)

for some real a and r . - • Lε near the period makes HMC ergodic but practically useless

• interactions between variables prevent exact periodicities, but

near periodicities might still slow HMC considerably!

Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

Periodicity

Example

q ∼ N (0, 1), H(q, p) = q2/2 + p2/2

dq

dp

= p

= −q

dt

dt

q(Lε) = r cos(a + Lε),

p(Lε) = −r sin(a + Lε)

If we chose a trajectory length such that Lε = 2π at the end of the

iteration we will return to the starting point! - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

Periodicity

Example

q ∼ N (0, 1), H(q, p) = q2/2 + p2/2

dq

dp

= p

= −q

dt

dt

q(Lε) = r cos(a + Lε),

p(Lε) = −r sin(a + Lε)

If we chose a trajectory length such that Lε = 2π at the end of the

iteration we will return to the starting point!

• Lε near the period makes HMC ergodic but practically useless

• interactions between variables prevent exact periodicities, but

near periodicities might still slow HMC considerably! - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

Other useful tools - Windowed HMC

The leapfrog introduces ”random” errors in H at each step and

hence the acceptance probability may have an high variability.

Smoothing (over windows of states) out this oscillations could lead

to higher acceptance rates. - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

Other useful tools - Windowed HMC

The leapfrog introduces ”random” errors in H at each step and

hence the acceptance probability may have an high variability.

Smoothing (over windows of states) out this oscillations could lead

to higher acceptance rates.

Windows map

We map (q, p) → [(q0, p0), . . . , (qW −1, pW −1)] by writing

(q, p) = (qs , ps ), s ∼ U (0, W − 1)

and deterministically recover the other states. This windows has

probability density

W −1

1

P([(q0, p0), . . . , (qW −1, pW −1)]) =

P(qi , pi )

W i=0 - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

Windowed HMC

Similarly we perform L - W + 1 leapfrog steps starting from

(qW −1, pW −1), up to (qL, pL) and then accept the window

[(qL−W +1, −pL−W +1), . . . , (qL, −pL)] with probability

L

P(qi , pi )

min 1,

i =L−W +1

W −1 P(q

i =0

i , pi )

and finally select (qa, −pa) with probability

P(qa, pa)

P(q

i ∈accepted window

i , pi ) - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

Slice Sampler

Slice Sampling idea

Sampling from f (x ) is equivalent to uniform sampling from

SG(f ) = {(x, y )|0 ≤ y ≤ f (x)}

the subgraph of f

We simply make use of this by sampling from f with an auxiliary

variable Gibbs sampling:

• y |x ∼ U (0, f (x))

• x|y ∼ US where S

y

y = {x |y ≤ f (x )} is the slice - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

What we (may) have and want to avoid

Opacity and size grows with leapfrog steps made

In black start and end-point - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

Main idea

Define a criterion that helps us avoid these U-Turn, stopping when

we simulated ”long enough”: instantaneous distance gain

∂ 1

∂

C (q, q ) =

(q −q)T (q −q) = (q −q)T

(q −q) = (q −q)T p

∂t 2

∂t

Simulating until C (q, q ) < 0 lead to a non-reversible MC, so the

authors devised a different scheme. - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

Main Idea

• NUTS augments the model with a slice variable u

• Add a finite set C of candidates for the update

• C ⊆ B deterministically chosen,

with B the set of all leapfrog steps - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

Main Idea

• NUTS augments the model with a slice variable u

• Add a finite set C of candidates for the update

• C ⊆ B deterministically chosen,

with B the set of all leapfrog steps

(At a high level) B is built by doubling and checking C (q, q ) on

sub-trees - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

High level NUTS steps

1 resample momentum p ∼ N (0, I )

2 sample u|q, p ∼ U [0, exp(−H (qt , p))]

3 generate the proposal from p(B, C|qt , p, u, ε)

4 sample (qt+1, p) ∼ T (·|qt , p, C)

T (·|qt, p, C) s.t. leave the uniform distribution over C invariant

(C contain (q , p ) s.t. u ≤ exp(−H(q , p )) and sat. reversibility) - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

Justify T (q , p |qt, p, C)

Conditions on p(B, C|q, p, u, ε):

C.1: Elements in C chosen in a volume-preserving way

C.2: p((q, p) ∈ C|q, p, u, ε) = 1

C.3: p(u ≤ exp(−H(q , p ))|(q , p ) ∈ C) = 1

C.4: If (q, p) ∈ C and (q , p ) ∈ C then

p(B, C|q, p, u, ε) = p(B, C|q , p , u, ε)

by them

p(q, p|u, B, C, ε)

∝ p(B, C|q, p, u, ε)p(q, p|u)

∝ p(B, C|q, p, u, ε)I{u≤exp(−H(q ,p ))}

C .1

∝ IC

C .2&C .4 − C .3 - Given the start (q, p) and ε : 2j equi-probable height-j trees

Reconstructing a particular height-j tree from any leaf has prob 2−j

Possible stopping rule: at height j

• for one of the 2j − 1 subtrees

(q+, q−)T p− < 0

or

(q+, q−)T p+ < 0

• the tree includes a leaf s.t. log(u) − H(q, p) > ∆max

Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

p(B, C|q, p, u, ε) - Building B by doubling

Build B by repeatedly doubling a binary tree with (q,p) leaves.

• Chose a random ”direction in time” νj ∼ U({−1, 1})

• Take 2j leap(frog)s of size νj ε from

(q−, p−)I{−1}(νj ) + (q+, p+)I{+1}(νj )

• Continue until a stopping rule is met - Possible stopping rule: at height j

• for one of the 2j − 1 subtrees

(q+, q−)T p− < 0

or

(q+, q−)T p+ < 0

• the tree includes a leaf s.t. log(u) − H(q, p) > ∆max

Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

p(B, C|q, p, u, ε) - Building B by doubling

Build B by repeatedly doubling a binary tree with (q,p) leaves.

• Chose a random ”direction in time” νj ∼ U({−1, 1})

• Take 2j leap(frog)s of size νj ε from

(q−, p−)I{−1}(νj ) + (q+, p+)I{+1}(νj )

• Continue until a stopping rule is met

Given the start (q, p) and ε : 2j equi-probable height-j trees

Reconstructing a particular height-j tree from any leaf has prob 2−j - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

p(B, C|q, p, u, ε) - Building B by doubling

Build B by repeatedly doubling a binary tree with (q,p) leaves.

• Chose a random ”direction in time” νj ∼ U({−1, 1})

• Take 2j leap(frog)s of size νj ε from

(q−, p−)I{−1}(νj ) + (q+, p+)I{+1}(νj )

• Continue until a stopping rule is met

Given the start (q, p) and ε : 2j equi-probable height-j trees

Reconstructing a particular height-j tree from any leaf has prob 2−j

Possible stopping rule: at height j

• for one of the 2j − 1 subtrees

(q+, q−)T p− < 0

or

(q+, q−)T p+ < 0

• the tree includes a leaf s.t. log(u) − H(q, p) > ∆max - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

Satisfying Detailed-Balance

We’ve defined p(B|q, p, u, ε) → deterministically select C

Remember the conditions:

C.1: Elements in C chosen in a volume-preserving way

C.2: p((q, p) ∈ C|q, p, u, ε) = 1

C.3: p(u ≤ exp(−H(q , p ))|(q , p ) ∈ C) = 1

C.4: If (q, p) ∈ C and (q , p ) ∈ C then

p(B, C|q, p, u, ε) = p(B, C|q , p , u, ε) - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

Satisfying Detailed-Balance

We’ve defined p(B|q, p, u, ε) → deterministically select C

Remember the conditions:

C.1:

Satisfied because of the leapfrog

C.2: p((q, p) ∈ C|q, p, u, ε) = 1

C.3: p(u ≤ exp(−H(q , p ))|(q , p ) ∈ C) = 1

C.4: If (q, p) ∈ C and (q , p ) ∈ C then

p(B, C|q, p, u, ε) = p(B, C|q , p , u, ε) - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

Satisfying Detailed-Balance

We’ve defined p(B|q, p, u, ε) → deterministically select C

Remember the conditions:

C.1:

Satisfied because of the leapfrog

C.2:

Satisfied if C includes the initial state

C.3: p(u ≤ exp(−H(q , p ))|(q , p ) ∈ C) = 1

C.4: If (q, p) ∈ C and (q , p ) ∈ C then

p(B, C|q, p, u, ε) = p(B, C|q , p , u, ε) - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

Satisfying Detailed-Balance

We’ve defined p(B|q, p, u, ε) → deterministically select C

Remember the conditions:

C.1:

Satisfied because of the leapfrog

C.2:

Satisfied if C includes the initial state

C.3:

Satisfied if we exclude points outside the slice u

C.4: If (q, p) ∈ C and (q , p ) ∈ C then

p(B, C|q, p, u, ε) = p(B, C|q , p , u, ε) - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

Satisfying Detailed-Balance

We’ve defined p(B|q, p, u, ε) → deterministically select C

Remember the conditions:

C.1:

Satisfied because of the leapfrog

C.2:

Satisfied if C includes the initial state

C.3:

Satisfied if we exclude points outside the slice u

C.4: As long as C given B is deterministic

p(B, C|q, p, u, ε) = 2−j

or

p(B, C|q, p, u, ε) = 0 - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

Satisfying Detailed-Balance

We’ve defined p(B|q, p, u, ε) → deterministically select C

Remember the conditions:

C.1:

Satisfied because of the leapfrog

C.2:

Satisfied if C includes the initial state

C.3:

Satisfied if we exclude points outside the slice u

C.4:

Satisfied if we exclude states that couldn’t generate B - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

Satisfying Detailed-Balance

We’ve defined p(B|q, p, u, ε) → deterministically select C

Remember the conditions:

C.1:

Satisfied because of the leapfrog

C.2:

Satisfied if C includes the initial state

C.3:

Satisfied if we exclude points outside the slice u

C.4:

Satisfied if we exclude states that couldn’t generate B

Finally we select (q’,p’) at random from C - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

Efficient NUTS

The computational cost per-leapfrog step is comparable with HMC

(just 2j+1 − 2 more inner products)

However:

• it requires to store 2j position-momentum states

• long jumps are not guaranteed

• waste time if a stopping criterion is met during doubling - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

Efficient NUTS

The authors address these issues as follow:

• waste time if a stopping criterion is met during doubling

as soon as a stopping rule is met break out of the loop - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

Efficient NUTS

The authors address these issues as follow:

• long jumps are not guaranteed

Consider the kernel:

I[w ∈Cnew ]

if |Cnew | > |Cold |,

T (w |w , C) =

|Cnew |

,

I[w ∈Cnew ] |Cnew | + (1 − |Cnew | )

|Cnew |

|Cold |

|Cold | I[w

= w ]

if |Cnew | ≤ |Cold |

where w is short for (q,p) and let Cnew and Cold be respectively the set C

introduced during the final iteration and the older elements already in C. - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

Efficient NUTS

The authors address these issues as follow:

• it requires to store 2j position-momentum states

• long jumps are not guaranteed

Consider the kernel:

I[w ∈Cnew ]

if |Cnew | > |Cold |,

T (w |w , C) =

|Cnew |

,

I[w ∈Cnew ] |Cnew | + (1 − |Cnew | )

|Cnew |

|Cold |

|Cold | I[w

= w ]

if |Cnew | ≤ |Cold |

where w is short for (q,p) and let Cnew and Cold be respectively the set C

introduced during the final iteration and the older elements already in C.

Iteratively applying the above after every doubling moreover require that

we store only O(j ) position (and sizes) rather than O(2j ) - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

Adaptive MCMC

Classic adaptive MCMC idea:

stochastic optimization on the parameter with vanishing

adaptation:

θt+1 ← θt − ηtHt

where R

ηt → 0 with the ”correct” pace and

Ht = δ − αt defines a characteristic of interest of the chain.

Example

Consider the case of a random-walk MH with x |xt ∼ N (0, θt)

adapted on the acceptance rate (αt) in order to get to the desired

optimum δ = 0.234 - The authors rely then on the following scheme:

√

t

1

t

εt+1 = µ −

Hi

˜

εt+1 = ηtεt+1 + (1 − ηt)˜

εt

γ t + t0 i=1

adaptation

sampling

γ shrinkage → µ

ηt = t−k

t0 early iteration

k ∈ (0.5, 1]

Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

Double Averaging

This idea has however (in particoular) an intrinsic flaw:

• the diminishing step sizes ηt give more weight to the early

iterations - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

Double Averaging

This idea has however (in particoular) an intrinsic flaw:

• the diminishing step sizes ηt give more weight to the early

iterations

The authors rely then on the following scheme3:

√

t

1

t

εt+1 = µ −

Hi

˜

εt+1 = ηtεt+1 + (1 − ηt)˜

εt

γ t + t0 i=1

adaptation

sampling

γ shrinkage → µ

ηt = t−k

t0 early iteration

k ∈ (0.5, 1]

3introduced by Nesterov (2009) for stochastic convex optimization - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

Select the criterion

The authors advice to use Ht = δ − αt with δ = 0.65 and

P(q , −p )

αHMC

t

= min 1, P(qt,−pt)

1

P(q

αNUTS

t , −pt )

t

=

min 1,

|Bfinal |

P(q

t

t−1, −pt )

(qt ,pt )∈Bfinal

t - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

Alternative: random ε

Periodicity is still a problem and the ”optimal” ε may differ in

different region of the target! (e.g. mixtures)

The solution might be to randomize ε around some ε0

Example

ε ∼ E (ε0)

ε ∼ U (c1ε0, c2ε0)

c1 < 1,

c2 > 1 - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

Alternative: random ε

Periodicity is still a problem and the ”optimal” ε may differ in

different region of the target! (e.g. mixtures)

The solution might be to randomize ε around some ε0

Helpfull especially for HMC, care is needed for NUTS as L is

usually inversely proportional to ε

What said about adaptation can be combined using ε0 = εt - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

Numerical Examples

”It at least as good as an ’optimally’ tuned HMC.”

Tested on:

1 (MVN) : 250-d Normal whose precision matrix was sampled

from Wishart with identity scale matrix and 250 df

2 (LR) : (24+1)-dim regression coefficient distribution on the

German credit data with weak priors

3 (HLR) : same dataset - exponentially distributed variance

hyper parameters and included two-way interactions →

(300+1)*2-d predictors

4 (SV) : 3000 days of returns from S&P 500 index → 3001-d

target following this scheme:

τ ∼ E (100);

ν ∼ E (100);

s1 ∼ E(100);

log yi − log yi−1

log si ∼ N (log si−1, τ −1);

∼ tν.

si - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

ε and L in NUTS

Also the dual averaging usually does a good job of coercing the

statistic H to its desired value.

Most of the trajectory lengths are integer powers of two, indicating

that the U-turn criterion is usually satisfied only after a doubling is

completed, which is desirable since it means that we only

occasionally have to throw out entire half-trajectories to satisfy DB. - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

Multivariate Normal

Highly correlated, 30-dim Multivariate Normal

To get the full advantage of HMC, the trajectory has to be long

enough (but not much longer) that in the least constrained

direction the end-point is distant from the start point. - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

Multivariate Normal

Highly correlated, 30-dim Multivariate Normal - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

Multivariate Normal

Highly correlated, 30-dim Multivariate Normal

Trajectory reverse direction before the least constrained direction

has been explored.

This can produce a U-Turn when the trajectory is much shorter

than is optimal. - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

Multivariate Normal

Highly correlated, 30-dim Multivariate Normal

Fortunately, too-short trajectories are cheap relative to

long-enough trajectories!

but it does not help in avoiding RW - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

Multivariate Normal

One way to overcome the problem could be to check for U-Turn

only in the least constrained

But usually such an information is not known a-priori.

Maybe after a trial run to discover the magnitude order of the

variables?

someone said periodicity? - • Take k step backward from (q0, p0) to (q−k , p−k )

• Accept (qL−k , pL−k ) with probability

W

wi P(qL−W +i , −pL−W +i )

min

i =1

1,

W

w

i P (qi −k−1, −pi −k−1)

i =1

• With uniform weights wi = 1/W ∀i

• Other weights may favor states further from the start!

Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

Non-Uniform weighting system

Qin and Liu (2001) propose a similar (to Neal’s) weighting

procedure:

• Generate the whole trajectory from (q0, p0)= (q, p) to (qL, pL)

• Select a point in the ”accept window”

[(qL−W +1, pL−W +1), . . . , (qL, pL)], say (qL−k , pL−k )

• equivalent to select k ∼ U (0, W ) - • Accept (qL−k , pL−k ) with probability

W

wi P(qL−W +i , −pL−W +i )

min

i =1

1,

W

w

i P (qi −k−1, −pi −k−1)

i =1

• With uniform weights wi = 1/W ∀i

• Other weights may favor states further from the start!

Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

Non-Uniform weighting system

Qin and Liu (2001) propose a similar (to Neal’s) weighting

procedure:

• Generate the whole trajectory from (q0, p0) to (qL, pL)

• Select a point (qL−k , pL−k )

• Take k step backward from (q0, p0) to (q−k , p−k )

• Reject window : [(q−k , p−k ), . . . , (qW −k−1, pW −k−1)] - • With uniform weights wi = 1/W ∀i

• Other weights may favor states further from the start!

Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

Non-Uniform weighting system

Qin and Liu (2001) propose a similar (to Neal’s) weighting

procedure:

• Generate the whole trajectory from (q0, p0) to (qL, pL)

• Select a point (qL−k , pL−k )

• Take k step backward from (q0, p0) to (q−k , p−k )

• Accept (qL−k , pL−k ) with probability

W

wi P(qL−W +i , −pL−W +i )

min

i =1

1,

W

w

i P (qi −k−1, −pi −k−1)

i =1 - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

Non-Uniform weighting system

Qin and Liu (2001) propose a similar (to Neal’s) weighting

procedure:

• Accept (qL−k , pL−k ) with probability

W

wi P(qL−W +i , −pL−W +i )

min

i =1

1,

W

w

i P (qi −k−1, −pi −k−1)

i =1

• With uniform weights wi = 1/W ∀i

• Other weights may favor states further from the start! - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

Riemann Manifold HMC

• In statistical modeling the parameter space is a manifold

• d (p(y |θ), p(y |θ + δθ)) can be defined as δθT G (θ)δθ

G (θ) is the expected Fisher information matrix [Rao (1945)]

• G (θ) defines a position-specific Riemann metric.

Girolami & Calderhead (2011) used this to tune K(p) using local

2nd -order derivative information encoded in M = G (q) - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

Riemann Manifold HMC

Girolami & Calderhead (2011) used this to tune K(p) using local

2nd -order derivative information encoded in M = G (q)

The deterministic proposal is guided not only by the gradient of

the target density but also exploits local geometric structure.

Possible optimality : paths that are produced by the solution of

Hamiltoninan equations follow the geodesics on the manifold. - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

Riemann Manifold HMC

Girolami & Calderhead (2011) used this to tune K(p) using local

2nd -order derivative information encoded in M = G (q)

Practically:

p|q ∼ N (0, G (q)) resolving the scaling issues in HMC

• tuning of ε less critical

• non separable Hamiltonian

1

1

H(q, p) = U(q) +

log

(2π)D |G (q)|

+

pT G (q)−1p

2

2

• need for an (expensive) implicit integrator

(generalized leapfrog + fixed point iterations) - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

RMHMC & Related

”In some cases, the computational overhead for solving implicit

equations undermines RMHMCs benefits” . [Lan et al. (2012)]

Lagrangian Dynamics (RMLMC)

Replaces momentum p (mass×velocity) in HMC by velocity v

volume correction through the Jacobian

• semi-implicit integrator for ”Hamiltonian” dynamics

• different fully explicit integrator for Lagrangian dynamics

two extra matrix inversions to update v

Adaptively Updated (AUHMC)

Replace G (q) with M(q, q ) = 1 [G (q) + G (q )]

2

• fully explicit leapfrog integrator (M(q, q ) constant through

trajectory)

• less local-adaptive (especially for long trajectories)

• is it really reversible? - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

Split Hamiltonian

Variations on HMC obtained by using discretizations of

Hamiltonian dynamics, splitting H into:

H(q, p) = H1(q, p) + H2(q, p) + · · · + HK (q, p)

This may allows much of the movement to be done at low cost.

• log of U(q) as the log of a Gaussian plus a second term

quadratic forms allow for explicit solution

• H1 and its gradient can be eval. quickly, with only a slowly-

varying H2 requiring costly computations. e.g. splitting data - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

Split Hamiltonian

• log of U(q) as the log of a Gaussian plus a second term

quadratic forms allow for explicit solution

• H1 and its gradient can be eval. quickly, with only a slowly-

varying H2 requiring costly computations. e.g. splitting data - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

split-wRMLNUTS & co.

• NUTS itself provide a nice automatic tune of HMC

• very few citation to their work!

• integration with RMHMC (just out!) may overcome ε-related

problems and provide better criteria for the stopping rule!

HMC may not be the definitive sampler, but is definitely equally

useful and unknown.

It seems the direction taken may finally overcome the difficulties

connected with its use and spread. - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

Betancourt (2013b)

• RMHMC has smoother movements on the surface but..

• .. (q − q)T p has no more a meaning

The paper address this issue by generalizing the stopping rule to

other M mass matrices. - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

References I

Introduction:

• M.D. Hoffmann & A. Gelman (2011) - The No-U-Turn

Sampler

• R.M. Neal (2011) - MCMC using Hamiltonian dynamics in

Handbook of Markov Chain Monte Carlo

• Z. S. Qin, and J.S. Liu (2001) - Multipoint Metropolis

method with application to hybrid Monte Carlo

• R.M. Neal (2003) - Slice Sampling - Hamiltonian MCMC

NUTS

ε tuning

Numerical Results

Conclusion

References II

Further Readings:

• M. Girolami, B. Calderhead (2011) - Riemann manifold

Langevin and Hamiltonian Monte Carlo methods

• Z. Wang, S. Mohamed, N. de Freitas (2013) - Adaptive

Hamiltonian and Riemann Manifold Monte Carlo Samplers

• S. Lan, V. Stathopoulos, B. Shahbaba, M. Girolami (2012) -

Lagrangian Dynamical Monte Carlo

• M. Burda, JM. Maheu (2013) - Bayesian Adaptively Updated

Hamiltonian Monte Carlo with an Application to

High-Dimensional BEKK GARCH Models

• Michael Betancourt (2013) - Generalizing the No-U-Turn

Sampler to Riemannian Manifolds