Here we will explore the relationship between maximizing likelihood p.d.f - , maximizing posterior p.d.f - , minimization of the sum-of-squares error function - and the regularization technique.
When we maximise the posterior probability density function w.r.t the parameter vector using the “bayesian technique” - we need both the likelihood function and the prior. (The denominator in the bayes theorem is just a normalization constant so that doesn’t really matter.)
The model
We have a set of inputs, with corresponding target values .
We assume that there exists some deterministic function such that we can model the relationship between these two as the sum of with additive gaussian noise,
is the precision (inverse variance) of the additive univariate Gaussian noise.
We define as the linear combination of basis functions,
We define the parameter vector as and basis vector as .
This parameter vector is very important as the posterior p.d.f is the updated probability of given some training data. which is found from the prior data of . While the likelihood p.d.f of getting that training data given .
We usually choose because we need a bias term in the model. (to control the extent of the shift in itself - check this answer out)
For the data set as a whole we can write the set of model outputs as a vector ,
Here the basis matrix is a function of and is defined with its th-row being = for such rows.
Likelihood function
We assume that these data points are drawn independently from the distribution we would have to multiply the individual data point’s p.d.f - which are gaussian.
Note that the th data points p.d.f is centered around as the mean.
Does the product of univariate gaussians forms a multivariate distribution in {}?? I say this because we choose a gaussian prior, thus the likelihood should also be gaussian right?
Prior
We choose the corresponding conjugate prior, as we have a likelihood function which is the exponential of a quadratic function of .
No clue why but for now for this to make sense let’s say that the likelihood function is also gaussian - product of all those gaussians.
Thus the prior p.d.f is a normal distribution -
Posterior
The posterior p.d.f is a (as we choose a conjugate prior)
After solving for and we get,
(The complete derivation is available in Bishop - (2.116)) - coming soon
The sizes are,
The mean vectors, and are both and they can be thought of as the optimal parameter vector and pseudo observations respectively.
The covariance matrices, and are both
We shall consider a particular form of Gaussian prior in order to simplify the treatment. Specifically, we assume a zero-mean isotropic Gaussian governed by a single precision parameter ,
So we basically take and
Thus if we use this prior we can simplify the mean vector and the covariance matrix of posterior p.d.f to,
Now if we take log of the posterior pdf - , in order to maximize it with respect to w, we find that what we obtain is equivalent to the minimization of the sum-of-squares error function with the addition of a quadratic regularization term, corresponding to .
Thus we conclude that while maximising likelihood function is equivalent to the minimization of the sum-of-squares error function, maximising posterior p.d.f is equivalent to the regularization technique.
The regularization technique is used to control the over-fitting phenomenon by adding a penalty term to the error function in order to discourage the coefficients from reaching large values.
This penalty term arises naturally when we maximize posterior p.d.f w.r.t
Here the minimization of the sum-of-squares error function - is also same as Maximization of the likelihood p.d.f. Taking log of we get,
thus maximizing likelihood is equal to maximizing (rest are all constants w.r.t )