Loading [MathJax]/jax/output/HTML-CSS/jax.js

Saturday, March 12, 2016

Bayesian Technique

Here we will explore the relationship between maximizing likelihood p.d.f - p(t|w), maximizing posterior p.d.f - p(w|t), minimization of the sum-of-squares error function - ED(w) and the regularization technique.

When we maximise the posterior probability density function w.r.t the parameter vector using the “bayesian technique” - we need both the likelihood function and the prior. (The denominator in the bayes theorem is just a normalization constant so that doesn’t really matter.)

p(w|t)p(t|w)p(w)

The model

We have a set of inputs, x=[x1,...,xn]T with corresponding target values t=[t1,....tn]T.

We assume that there exists some deterministic function y such that we can model the relationship between these two as the sum of y(x,w) with additive gaussian noise,

ti=y(xi,w)+N(0,β1)

β is the precision (inverse variance) of the additive univariate Gaussian noise.

We define y as the linear combination of basis functions,
y(xi,w)=w1ϕ1(xi)+w2ϕ2(xi)+...+wpϕp(xi)=wTϕ(xi)
We define the parameter vector as wp×1=[w1,w2,...,wp]T and basis vector as ϕ(xi)=[ϕ1(xi),ϕ2(xi),...,ϕp(xi)]T.

This parameter vector is very important as the posterior p.d.f is the updated probability of w given some training data. which is found from the prior data of w. While the likelihood p.d.f of getting that training data given w.

We usually choose ϕ1(x)=1 because we need a bias term in the model. (to control the extent of the shift in y itself - check this answer out)

For the data set as a whole we can write the set of model outputs as a vector yn×1,

y(x,w)=Φw

Here the basis matrix Φn×p is a function of x and is defined with its ith-row being = [ϕ1(xi),ϕ2(xi),...,ϕp(xi)] for n such rows.

Likelihood function

We assume that these data points (xi,ti) are drawn independently from the distribution we would have to multiply the individual data point’s p.d.f - which are gaussian.

p(t|x,w,β)=ni=1N(ti|wTϕ(xi),β1)

Note that the ith data points p.d.f is centered around wTϕ(xi) as the mean.

Does the product of n univariate gaussians forms a multivariate distribution in {ti}?? I say this because we choose a gaussian prior, thus the likelihood should also be gaussian right?

Prior

We choose the corresponding conjugate prior, as we have a likelihood function which is the exponential of a quadratic function of w.

No clue why but for now for this to make sense let’s say that the likelihood function is also gaussian - product of all those gaussians.

Thus the prior p.d.f is a normal distribution - N(mo,So)

Posterior

The posterior p.d.f is a N(mN,SN) (as we choose a conjugate prior)

After solving for mN and SN we get,

(The complete derivation is available in Bishop - (2.116)) - coming soon

mN=SN(S10m0+βΦTt)
S1N=S10+βΦTΦ

The sizes are,
The mean vectors, mN and mo are both p×1 and they can be thought of as the optimal parameter vector and pseudo observations respectively.
The covariance matrices, SN and So are both p×p

We shall consider a particular form of Gaussian prior in order to simplify the treatment. Specifically, we assume a zero-mean isotropic Gaussian governed by a single precision parameter α,
N(0,α1I)
So we basically take mo=0 and So=α1Ip×p

Thus if we use this prior we can simplify the mean vector and the covariance matrix of posterior p.d.f to,

mN=βSNΦTt
S1N=αI+βΦTΦ

Now if we take log of the posterior pdf - N(mN,SN) , in order to maximize it with respect to w, we find that what we obtain is equivalent to the minimization of the sum-of-squares error function with the addition of a quadratic regularization term, corresponding to λ=αβ.

lnp(w|t)=β2Ni=1(tiwTϕ(xi))2αβwTw=βED(w)αβwTw

Thus we conclude that while maximising likelihood function is equivalent to the minimization of the sum-of-squares error function, maximising posterior p.d.f is equivalent to the regularization technique.

The regularization technique is used to control the over-fitting phenomenon by adding a penalty term to the error function in order to discourage the coefficients from reaching large values.

This penalty term arises naturally when we maximize posterior p.d.f w.r.t w

Here the minimization of the sum-of-squares error function - ED(w) is also same as Maximization of the likelihood p.d.f. Taking log of lnp(t|w,β) we get,

lnp(t|w,β)=N2lnβN2ln(2π)βED(w)

thus maximizing likelihood is equal to maximizing ED(w) (rest are all constants w.r.t w)

No comments:

Post a Comment