Here we will explore the relationship between maximizing likelihood p.d.f - p(→t|→w), maximizing posterior p.d.f - p(→w|→t), minimization of the sum-of-squares error function - ED(→w) and the regularization technique.
When we maximise the posterior probability density function w.r.t the parameter vector using the “bayesian technique” - we need both the likelihood function and the prior. (The denominator in the bayes theorem is just a normalization constant so that doesn’t really matter.)
p(→w|→t)∝p(→t|→w)p(→w)
The model
We have a set of inputs, →x=[x1,...,xn]T with corresponding target values →t=[t1,....tn]T.
We assume that there exists some deterministic function y such that we can model the relationship between these two as the sum of y(x,w) with additive gaussian noise,
ti=y(xi,→w)+N(0,β−1)
β is the precision (inverse variance) of the additive univariate Gaussian noise.
We define y as the linear combination of basis functions,
y(xi,→w)=w1ϕ1(xi)+w2ϕ2(xi)+...+wpϕp(xi)=→wT→ϕ(xi)
We define the parameter vector as →wp×1=[w1,w2,...,wp]T and basis vector as →ϕ(xi)=[ϕ1(xi),ϕ2(xi),...,ϕp(xi)]T.
This parameter vector is very important as the posterior p.d.f is the updated probability of →w given some training data. which is found from the prior data of →w. While the likelihood p.d.f of getting that training data given →w.
We usually choose ϕ1(x)=1 because we need a bias term in the model. (to control the extent of the shift in y itself - check this answer out)
For the data set as a whole we can write the set of model outputs as a vector →yn×1,
→y(→x,→w)=Φ→w
Here the basis matrix Φn×p is a function of →x and is defined with its ith-row being = [ϕ1(xi),ϕ2(xi),...,ϕp(xi)] for n such rows.
Likelihood function
We assume that these data points (xi,ti) are drawn independently from the distribution we would have to multiply the individual data point’s p.d.f - which are gaussian.
p(→t|→x,→w,β)=n∏i=1N(ti|→wT→ϕ(xi),β−1)
Note that the ith data points p.d.f is centered around →wT→ϕ(xi) as the mean.
Does the product of n univariate gaussians forms a multivariate distribution in {ti}?? I say this because we choose a gaussian prior, thus the likelihood should also be gaussian right?
Prior
We choose the corresponding conjugate prior, as we have a likelihood function which is the exponential of a quadratic function of →w.
No clue why but for now for this to make sense let’s say that the likelihood function is also gaussian - product of all those gaussians.
Thus the prior p.d.f is a normal distribution - N(mo,So)
Posterior
The posterior p.d.f is a N(mN,SN) (as we choose a conjugate prior)
After solving for mN and SN we get,
(The complete derivation is available in Bishop - (2.116)) - coming soon
mN=SN(S−10m0+βΦT→t)
S−1N=S−10+βΦTΦ
The sizes are,
The mean vectors, mN and mo are both p×1 and they can be thought of as the optimal parameter vector and pseudo observations respectively.
The covariance matrices, SN and So are both p×p
We shall consider a particular form of Gaussian prior in order to simplify the treatment. Specifically, we assume a zero-mean isotropic Gaussian governed by a single precision parameter α,
N(→0,α−1I)
So we basically take mo=→0 and So=α−1Ip×p
Thus if we use this prior we can simplify the mean vector and the covariance matrix of posterior p.d.f to,
mN=βSNΦT→t
S−1N=αI+βΦTΦ
Now if we take log of the posterior pdf - N(mN,SN) , in order to maximize it with respect to w, we find that what we obtain is equivalent to the minimization of the sum-of-squares error function with the addition of a quadratic regularization term, corresponding to λ=αβ.
lnp(→w|→t)=−β2N∑i=1(ti−→wT→ϕ(xi))2−αβ→wT→w=−βED(→w)−αβ→wT→w
Thus we conclude that while maximising likelihood function is equivalent to the minimization of the sum-of-squares error function, maximising posterior p.d.f is equivalent to the regularization technique.
The regularization technique is used to control the over-fitting phenomenon by adding a penalty term to the error function in order to discourage the coefficients from reaching large values.
This penalty term arises naturally when we maximize posterior p.d.f w.r.t →w
Here the minimization of the sum-of-squares error function - ED(→w) is also same as Maximization of the likelihood p.d.f. Taking log of lnp(t|→w,β) we get,
lnp(t|w,β)=N2lnβ−N2ln(2π)−βED(→w)
thus maximizing likelihood is equal to maximizing −ED(→w) (rest are all constants w.r.t →w)