Monte Carlo Markov Chain (MCMC) explained (2024)

Understanding the magic behind estimating complex entities using randomness

Published in

Towards Data Science

10 min read

Jul 27, 2021

Introduction

MCMC methods are a family of algorithms that uses Markov Chains to perform Monte Carlo estimate.

The name gives us a hint, that it is composed of two components —
Monte Carlo and Markov Chain. Let us understand them separately and in their combined form.

(Intuitively)

Monte Carlo method derives its name from a Monte Carlo casino in Monaco.

It is a technique for sampling from a probability distribution and using those samples to approximate desired quantity. In other words, it uses randomness to estimate some deterministic quantity of interest.

Example: If we are asked to calculate the area under the curve for the given curve in the below image, it might require integrating over complex analytical formula.
However using the Monte Carlo method, we will randomly generate red dots
(more dots for more accuracy) in the rectangle and calculate the ratio of dots falling under the curve w.r.t dots falling in the entire rectangle — the ratio will provide us with the area, given the area of the rectangle.

Monte Carlo Markov Chain (MCMC) explained (4)

Basically, if calculating some quantity has a complex analytical structure, we can simply perform a simulation to generate lots of samples and use them to approximate the quantity. These works provided asymptotically they follow central limit theorem.

It has many other use cases in risk analysis, reliability analysis etc.

(Mathematically)

Say we have Expectation (s) to estimate, this could be a highly complex integral or even intractable to estimate— using the Monte Carlo method we resort to approximate such quantities by averaging over samples.

Monte Carlo Markov Chain (MCMC) explained (5)

Monte Carlo Markov Chain (MCMC) explained (6)

Computing the average over a large number of samples could reduce the standard error and provide us with a fairly accurate approximation.

“This method has a limitation, for it assumes to easily sample from a probability distribution, however doing so is not always possible. Sometimes, we can’t even sample from the distribution. In such cases, we make use of Markov chains to efficiently sample from an intractable probability distribution.”

Before getting into Markov chains, let us have a look over useful property that defines it —

Markov property :

Monte Carlo Markov Chain (MCMC) explained (7)

Consider a system of 4 states we have from the above image—

‘Rain’ or ‘Car Wash' causing the ‘Wet Ground' followed by ‘Wet Ground' causing the ‘Slip’.

Markov property simply makes an assumption — the probability of jumping from one state to the next state depends only on the current state and not on the sequence of previous states that lead to this current state.

If we are to calculate the probability of someone slipping, knowing whether the ground is wet or not provides enough evidence to estimate it. We need not require the states that lead to it (‘Rain’ or ‘Car Wash’).

mathematically speaking :

Monte Carlo Markov Chain (MCMC) explained (8)

It is evident from the mathematical equation that the Markov property assumption could potentially save us a lot of computation.

In hindsight, If a process exhibits Markov Property, then it is known as
Markov Chain.

Now that we have seen Markov Chain, let us discuss the property that makes it so desirable — Stationary Distribution.

Stationary Distribution :

Suppose, we have a process of few states and we have a fixed transition probability (Q) of jumping between states.
We start with some random probability distribution over all states (Sᵢ)
at time step i, and to estimate the probability distribution over all states at the next time step i.e i+1, we multiply it by transition probability Q.

Monte Carlo Markov Chain (MCMC) explained (9)

If we keep on doing this, after a while S stops changing on multiplying with matrix Q, this is when we say it has reached a stationary distribution.

Monte Carlo Markov Chain (MCMC) explained (10)

Let us see this with an example —

In this example, we have 3 states (X₁, X₂, X₃)

Monte Carlo Markov Chain (MCMC) explained (11)

If we are in the state S₂, the probability of staying put in S₂ is 0.1,
transitioning to state S₁ is 0, and transitioning to state S₃ is 0.9 (as evident from the second row in the matrix).

Let us start with some random value of vector Sᵢ (vector shows the probability of being at each state at any particular time step), we can see how the vector sums up to 1.

Monte Carlo Markov Chain (MCMC) explained (12)

Monte Carlo Markov Chain (MCMC) explained (13)

If we keep moving in time steps, eventually we will reach the stationary state,

Monte Carlo Markov Chain (MCMC) explained (14)

now the best part to know is this stationary distribution does not depend on the initial state, you can try experimenting with various initial state Sᵢ.

import numpy as npQ = np.matrix([[0,1,0],[0,0.1,0.9],[0.6,0.4,0]])
S_initial = np.matrix([[0.3, 0.4 , 0.3]])
epsilon = 1while epsilon > 10e-8:
 S_next = np.dot(S_initial, Q)
 epsilon = np.sqrt(np.sum(np.square(S_next - S_initial)))
 S_initial = S_nextprint(S_initial)

The stationary distribution shows the probability of being at any state at any given time.

Intuitively speaking Markov chains can be thought of as walking on the chain, given the state at a particular step, we can decide on the next state by seeing the ‘probability distribution of states’ over the next step.

Well, now that we have seen both Markov chains and Monte Carlo, let us put our focus on the combined form of these beauties, that is mcmc.

Extras —

Anyone who is curious and wants to understand as to why the Markov chain converges to stationary distribution at all can refer to the below image —

Monte Carlo Markov Chain (MCMC) explained (15)

MCMC can be used to sample from any probability distribution. Mostly we use it to sample from the intractable posterior distribution for the purpose of Inference.

Estimating the Posterior using Bayes can be difficult sometimes, in most of the cases we can find the functional form of Likelihood x Prior. However, computing marginalized probability P(B) can be computationally expensive, especially when it is a continuous distribution.

Monte Carlo Markov Chain (MCMC) explained (16)

The trick here is to avoid calculating the normalization constant altogether.

The General Idea for the algorithm is to start with some random probability distribution and gradually move towards desired probability distribution.

Sounds easy enough, but how do we do this?

Initiate a markov chain with a random probability distribution over states, gradually move in the chain converging towards stationary distribution, apply some condition (Detailed Balance Sheet) that ensures this stationary distribution resembles desired probability distribution.

Thus, on reaching stationary distribution we have approximated posterior probability distribution.

Monte Carlo Markov Chain (MCMC) explained (17)

the probability p(A) represents the probability of being at A, the probability T(A -> B) represents the probability of moving from A to B.

the probability p(B) represents the probability of being at B, the probability T(B -> A) represents the probability of moving from B to A.

Each of the sides represents probability flow from either A to B or B to A.
If the condition satisfies, then it guarantees the stationary state to be approximately representing posterior distribution.

Although MCMC itself is complicated, they provide a lot of flexibility. It provides us with efficient sampling in high-dimension. It can be used to solve problems with a large state space.

Limitation — MCMC doesn’t perform well in approximating probability distribution that has multi modes.

Extras —

Marginal Probability P(B) in this case is a constant known as a normalization constant that sums over the entire possible values of the numerator.

There are techniques for training or evaluating models that have intractable normalization constant (also known as partition function ). Few of them use MCMC for sampling in their algorithm.

e.g — Contrastive Divergence (CD) is useful for training unstructured graphical models like the Restricted Boltzmann machine.

Suppose we are sampling from distribution p(x) = f(x) / Z, where Z is the intractable normalization constant.

Our objective is to sample from p(x) in such a way that involves making use of numerator alone and avoids having to estimate denominator.

(Proposal Probability)

Let us start by looking at the proposal probability (g).
Given a sample, it proposes us with the next potential sample in the Markov chain.
(How to decide whether to accept or reject the potential sample is something we’ll see in the next section).

Monte Carlo Markov Chain (MCMC) explained (18)

Assume that g(X₂ | X₁) = Normal( X₁, σ )
(it could have been any distribution, for simplicity we chose normal distribution)

Keeping X₁ as mean, we make a normal distribution. Then we sample X₂
from this distribution.

We repeat the same step for sampling X₃, by keeping X₂ as the mean.

(Main Algorithm)

Let us begin this algorithm by following the detailed balance sheet condition.

Monte Carlo Markov Chain (MCMC) explained (19)

The probability of transitioning from X₁ to X₂ can be seen as a two-step process, considering we are at state X₁ —

The first step is to make a proposal of state X₂ with some
proposal probability g (discussed in the previous section).

The second step is to accept the new state X₂ with some
acceptance probability A.

Substituting the transition probability into the equation…

Monte Carlo Markov Chain (MCMC) explained (20)

Substituting the p(x) with f(x)/Z, being on both sides Z gets cancelled out, we get…

Monte Carlo Markov Chain (MCMC) explained (21)

restructuring the equation leads to

Monte Carlo Markov Chain (MCMC) explained (22)

Substituting short-hand notation

Monte Carlo Markov Chain (MCMC) explained (23)

Monte Carlo Markov Chain (MCMC) explained (24)

finally, we get the acceptance probability A

Monte Carlo Markov Chain (MCMC) explained (25)

Summarize —

we start with a random state.
based on proposal probability (g) we randomly pick a new state.
calculate the acceptance probability (A) of the proposed new state.
flip coin with probability to land head is equal to acceptance probability, if the coin lands head accept the sample else reject it.
repeat the process for quite some time.

We keep on sampling for a long time and discard the initial few samples as the chain has not reached its stationary state yet (this period is known as
burn-in period).

Monte Carlo Markov Chain (MCMC) explained (26)

Limitation :

After from chain getting stuck while approximating multi-modal distribution and getting biased samples and therefore less accurate estimate of the desired quantity.
Metropolis-Hasting gets very slow when the sample space is high-dimension. (another amazing MCMC method Hamiltonian Monte Carlo overcomes these short-comings and is a discussion for another post).

Conclusion :

This concludes the discussion for MCMC method. More posts on similar discussions are to be followed.

It has been by no means an easy concept to understand precisely because it is an amalgamation of concepts and those concepts are concealed behind mathematics.

If you have made it all the way, kudos to you!! All the best!!

Monte Carlo Markov Chain (MCMC) explained (2024)

Understanding the magic behind estimating complex entities using randomness

Introduction

Stationary Distribution :

Extras —

References