Showing posts with label bayesian analysis. Show all posts
Showing posts with label bayesian analysis. Show all posts

April 20, 2013

The Metropolis algorithm applied to Bayesian analysis

As the dimensionality of a Bayesian model increases, numerically integrating the parameter space becomes computationally intensive and, in all but the most premeditated cases, analytic computation is unfeasible. An alternative is to sample the posterior distribution
$$p(\theta|D) = \frac{p(D|\theta) p(\theta)}{\int p(D|\theta)p(\theta) d\theta}$$
via Monte-Carlo. The Metropolis algorithm, first discussed in Metropolis, Rosenblush, Teller (1953), allows you to sample from any distribution $p(x)$ providing you can compute a function $Cp(x)$ proportional to $p(x)$. In this case, the algorithm allows us to sample the posterior without performing the integral in the denominator.

The algorithm

We start by defining a transition probability
$$M(x,x') = q(x,x')\min\left(\frac{p(x')}{p(x)}, 1\right)$$
where $q$ is any arbitrary transition probability. In practice, to create a chain of samples based on $M$ we
  1. Sample $q(x, \cdot)$ to choose $x'$
  2. Move to $x'$ if $p(x') > p(x)$, else move to $x'$ with probability $p(x')/p(x)$, else remain in $x$.
As the number of iterations increases, the probability of being in state $x$ converges to $p(x)$. Intuitively, this happens because $p$ is an eigenfunction of $M$ with eigenvalue $1$. Further details can be found in Robert, Casella (2004).

Practical issues

Although the algorithm works in principle with any non-degenerate proposal distribution $q$, the optimal choice minimizes the burn-in period, i.e. the number of initial iterations we discard before considering the chain to have 'converged' to the target distribution.

April 17, 2013

The simplest possible Bayesian model

In chapter 5, Kruschke goes through the process of using Bayes' rule for updating belief in the 'lopsidedness' of a coin. Instead of using R I decided to try implementing the model in Excel.

The Model

Coin tosses are IID with a $\mathrm{Bernoulli}(\theta)$ distribution, thus the probability of observing $N$ of which $z$ are heads is
$$p(z|\theta, N) = \theta^z (1-\theta)^{(N-z)}$$
Using Bayes' rule we get
$$p(\theta|z,N) = \frac{\theta^z (1-\theta)^{N-z} p(\theta)}{p(z|N)}$$
Note that if $p(\theta)$ is already of the form
$$f(\theta;a,b) = \frac{\theta^{a-1} (1-\theta)^{b-1}}{B(a,b)}$$
where the denominator is a normalizing constant then $p(\theta|N) = f(\theta;a+z,b+N-z)$.

Excel Implementation

This is very straight-forward using the BETADIST function, which is cumulative, so we divide the unit interval into equally spaced subintervals, calculate the probability of $\theta$ being in the interval and update on the basis of the new observations.

Here's the result:


Discrete model

Using a discrete distribution for $\theta$ rather than a continuous parametric form we can allow for much more freedom in the choice of initial prior distribution. The update rule becomes
$$p(\theta|z_N,N)=\frac{p(z_N|z_{N-1},N-1,\theta)p(\theta|z_{N-1},N-1)}{\sum_{i=1}^n p(z_N|z_{N-1},N-1,\theta_i) p(\theta_i|z_{N-1},N-1)}$$
in other words
$$p_{N,j} = \frac{\tilde p_{N,j}}{\sum_{i=1}^n \tilde p_{N,i}}$$
where
$$\tilde p_{N,k} = (\theta_k y_N + (1-\theta_k)(1-y_N)) p_{N-1,j}$$
and $y_N = z_N - z_{N-1}$ is the result of the $N$th coin toss.

As expected the result is very similar to the continuous model.

In the discrete model we can assign any initial prior. Here is an example with an initial prior that assigns non-zero probability to only 3 values of $\theta$ (22.5%, 52.5% and 77.5%) and random tosses generated with a real theta of 20%. The distribution quickly converges to probability 1 that $\theta=22.5\%$.


April 12, 2013

Doing Bayesian Data Analysis

Assuming I can keep at it, I'll be making my way through Kruschke's Doing Bayesian Data Analysis. Here's a few concepts he goes through in Chapter 4.

The Bayes factor

This is a ratio which allows you to compare which out of two models best fits the data. By introducing a binary parameter which determines the choice of model, Bayes' rule
$$p(M_i|D)=\frac{p(D|M_i)p(M_i)}{p(D)}$$
gives us the Bayes factor
$$\frac{p(M_1|D)}{p(M_2|D)}=\frac{p(D|M_1)}{p(D|M_2)}\frac{p(M_1)}{p(M_2)}$$
or also
$$\frac{p(D|M_1)}{p(D|M_2)}=\frac{p(M_1|D)}{p(M_2|D)}\frac{p(M_2)}{p(M_1)}$$
Note that if we have no prior preference between the two models, these ratios are equal. The Bayes' factor is useful because on their own $p(D|M_i)$ scale on the basis of the model - the more complex the model, the lower the absolute value - but this is actually a good thing because by considering the ratio we automatically trade off complexity (when there's little data) against descriptive value (when the simpler model doesn't fit the data).

Data order invariance

I don't get this part - it's always true that $p(\theta|D', D)$ is invariant of the order in which Bayes' rule is applied, by definition. The factorized rule cited at the end is just the result of applying the definition of data independence given the parameters.

The problem with Bayesian mathematics

Computing the posterior given some new data usually means performing an integral, which, even using approximations, can be computationally intensive (especially since the posterior will be fed into an optimizer). Numerical integration on the other hand would impose a limit on the dimensionality of the model. Thus we winning method is Markov Chain Monte Carlo (MCMC).