Sunday, September 14, 2008

Why MCMC works?

Long ago, I gave a chalk-talk for the theoretical details of why MCMC works? I apologize for the lost references; these notes are mostly extracted out of online tutorials, classroom notes, and some books. These notes are more like cheat-sheet, and may not be very readable. Nevertheless, the step-by-step progression could provide useful pointers.

Importance Sampling


E_p[f] = \Int_X p(x) f(x) dx = \Int_X (p/q * f) * q dx
= \Int_X g(x) * q(x) dx
= E_q [p(x)/q(x) * f(x) ]
- unbiased, but has high variance.
To make it low variance, use the normalization, commonly called weighted importance sampling. The actual formulation for weighted sampling is motivated by the fact that we can not compute p(x)/q(x), rather we can only compute p*(x)/q*(x), where f* represents un-normalized distribution, and f~ denotes sample distribution for f. Note that the new estimator is biased.
We refer to p(x) as target and q(x) as proposal distributions. q(x) is a distribution which is easy to sample from, e.g., uniform, Gaussian, Cauchy, exponential).

Rejection Sampling


- generate y ~ q(.)
- sample u ~ Unif[0, c*q*(y)]
- if u > p~(y), reject y
~~~~ otherwise, accept y
p_sample(y) = p(y)
p_accept = Area(accept) / Area(accept) + Area(reject) \prop 1/c

Metropolis-Hastings algorithm


Given a target distribution p(x) = p*(x) / Z,
- Initialize X(0) arbitrarily
- For iteration t = 0, 1, 2
~~~~ > draw y from proposal distribution, q(.| X(t)), call it the transition function T(X(t), .))
~~~~ > accept/reject: sample u ~ Unif[0,1]
~~~~ > X(t+1) = ~~ y, if u \lt A(X(t), Y),
~~~~ ~~~~~~~~~~~~~ X(t), otherwise,
~~~~ ~~~~~~~ where A(X(t), Y) = min (1, p*(y)T(y, x(t)) / p*(x(t))T(x(t), y) )

Special case


T(x,y) = T(y,x) i.e., T is symmetric -- Metropolis algorithm, in which case,
A(x(t), y) = min(1, p*(y)/p*(x(t)) ). Thus, if p~(y) \geq p~(x(t)), always accept, otherwise accept with probability p~(y)/p~(x(t)) ( < 1)
Analogy: Always go uphill, and go downhill with some probability, so we are not stuck in a local maximum.
In a way, this algorithm is performing a random walk in the entire space.

Background on Markov Chains


- X(1) ... X(t)
- X(0) ~ q_0(.)
- transition X(t) --> X(t+1), specified by S(X(t), X(t+1)) = q(X(t+1) | X(t)). note that \Sum_y S(x,y) = 1.
- assume homogeneous chain, i.e., S is the same for all t.

Working


- initial distribution, q_0
- compute q_(t+1)(y) = \Sum_x q_t(x) S(x, y) {matrix notation: q_(t+1)' = q_t' * S}
Note that S is row-stochastic, i.e. S*1 = 1

Why it works?


We want to know if q_t converges to some limit, and if does, what does it converge to?
Definition: We say that \pi \in R^k is invariant (w.r.t. S) if \pi \geq 0, 1'\pi = \pi, and \pi' * S = \pi. Alternatively, \pi is a fixed point for the update equation.
Idea: Let X(0) ~ \pi, then q_t = q_0 * S^t = \pi* S^t = \pi. In other words, if we start in \pi, we stay in \pi. Hence for convergence, we want to reach such a point where we have q_t as invariant (w.r.t. S).
Note that the matrix S \in R^(k x k) is (a) non-negative, i.e., S(i,j) \geq 0, and (b) row-stochastic, i.e., S*1 = 1.

Perron-Frobenius theorem


Definition: spectral radius of a matrix S is defined as the highest absolute eigenvalue of the matrix.
Every non-negative, row-stochastic matrix S satisfies \rho(S) = 1. Let the corresponding left eigen-vector be v. v \gt 0, and v'S = v. So we can set \pi (in the above discussion) to the vector \pi_i = v_i / (\Sum_j v_j).
Corollary: Every finite-state Markov chain has an invariant distribution.
So we conclude that q_t converges. Next question: is it unique? In other words, we want to find out conditions when q_t converges to a unique \pi (now that we know that such a limit exists).
A related question is: when is an invariant distribution unique?
Ans: when the Markov chain is ergodic.
If the Markov chain is ergodic and aperiodic, then there is a unique invariant.

Back to Metropolis-Hastings


First, check if \Sum_x p(x) S(x,y) = p(y).
Remember, S(x,y) = T(x,y) min (1, p*(y) T(y,x) / p*(x) T(x,y) )
Definition: A matrix S satisfies detailed-balance w.r.t. \pi iff
\forall x,y, \pi(x)S(x,y) = \pi(y)S(y,x).
(also called reversibility, i.e., moving in both direction in the chain are equivalent)
Lemma: Any such \pi is invariant w.r.t. S.
So it suffices to check that S(x,y) is in detailed-balance w.r.t. the target distribution, which is true. Thus p(.) is invariant w.r.t. S.

Gibbs Sampling


To make it more practical and easy to handle, we need careful ways to design S.
Gibbs Sampling is one such special type of Metropolis-Hastings algorithm.
- transitions easy to draw from.
- Consider random vector (X1 ... Xn), where n is the number of nodes in the graphical model.
~~ initialize X(0) = X1(0) ... Xn(0) arbitrarily.
~~ For iteration t = 0,1,2...
~~~~~~~ update Xi(t+1) ~ p(.|X1(t) ...Xi-1(t) Xi+1(t) ... Xn(t))
i.e., walk through the graph and update using local conditional probabilities.

Thursday, September 04, 2008

Product of Experts and Contrastive Divergence

So we are LIVING again.

Instead of posting my notes here, I am posting a pointer/reminder to check the entries for "Boltzmann machines" and "product of experts" on scholarpedia.
 
Learning in Vision: September 2008