Ideas: Polynomial curve fitting

As given in the book we are trying to estimate the relationship between a independent variable $x$ and its dependent variable $t$ (both scalars) by modelling the relationship between them as a $M$ th order polynomial. Then the model output $\hat{t}$ can be written as a function of $x$ and $\vec{w}$ .

Note that by storing the $M$ coefficients that uniquely define this “model” as a vector $\vec{w}$ ,
$\vec{w} = [w_0,w_1, .... w_M]^T$
And by constructing a column vector (which is a function of $x$ ) ,
$\vec{k(x)} = [1, x, x^2, . . . , x^M]$
we can rewrite the sigma in $y(x_i,\vec{w})$ more concisely as the dot product between the two vectors as shown below,

$\hat{t} = y(x,\vec{w}) = \sum_{i=0}^{M} w_i x^i = \vec{k(x).\vec{w}}$

We were till now talking about a single $x$ and $t$ combination. What if we are given a Training Set consisting of $N$ observations of this system that we are trying to model. i.e. we get $\vec{x}$ and its corresponding $\vec{y}$ .

Where,
$\vec{x} = [x_1, x_2, .... x_N]^T$
$\vec{t} = [t_1, t_2, .... t_N]^T$

We can now choose a Error function which measures the misfit between the polynomial function and the training set. The error function corresponds to (one half of) the sum of the squares of the displacements of each data point $(x_i,t_i)$ from the model estimate $(x_i,\hat{t_i})$ .

$E(\vec{w}) =\frac{1}{2} \sum_{i=1}^{N}(\hat{t} -t_i)^2 = \frac{1}{2} \sum_{i=1}^{N}(\vec{k(x_i)}.\vec{w}-t_i)^2$

Now we see how taking the square of the error - while eliminating the difference between positive and negative error also helps us when we minimize wrt $\vec{w}$ in the sense that $E$ ’s derivative will be linear in terms of $\vec{w}$ , assuring us that there exists a unique solution for the minimization of $E(\vec{w})$ .

This is one of the reasons why we prefer to square the difference rather than take the modulus of the difference - Mean Absolute error in which case we will have discontinuities at every point where the individual error changes sign.

So in order to find the best fit polynomial (defined by $\vec{w^*}$ ) from the given training set we can differentiate the error function with respect to $\vec{w}$ and equate the resulting linear equation in $\vec{w}$ to 0.

$\frac{dE(\vec{w})}{\vec{w}}|_{\vec{w}=\vec{w^*}} =\begin{array}{cc}\frac{\partial E}{\partial w_0}\\ ...\\ \frac{dE}{\partial w_{M-1}}\\ \frac{dE}{\partial w_M}\\ \end{array} = \begin{array}{cc}0\\ ...\\ 0\\ 0\\ \end{array}$

Now the partial differentiation of the error function with respect to $w_j$ is,

$\frac{\partial E}{\partial w_j}= \frac{2}{2} \sum_{i=1}^{N}(y(x_i,\vec{w^*})-t_i) * (j w_j x_i^{j-1}) = 0$

The above equation is true for $\forall j \notin {0}$ . When $j=0$ , the $(j w_j x_i^{j-1})$ term will be one not 0.

In the above equation note that when we partially differentiate w.r.t $w_j$ all other $w's$ are considered to be a scalar. Let us now represent this above equation as a System of linear equations as shown below then we can easily see how the Moore–Penrose pseudoinverse is the solution to this minimization problem.

Now we should note that we have the training set available to us, so $\vec{x}$ and $\vec{y}$ are known while $\vec{w}$ is our unknown.
So if we construct a matrix $A$ of size $N\times (M+1)$ as follows,

$A = \begin{array}{cc} 1 & x_1 & x_1^2 & x_1^3 & ... & x_1^M \\ 1 & x_2 & x_2^2 & x_2^3 & ... & x_2^M\\ ... & ... & ... & ... & ... & ...\\ 1 & x_N & x_N^2 & x_N^3 & ... & x_N^M\\ \end{array}$

and we know $\vec{w}$ is a $(M+1)\times 1$ vector. Then our minimization of the error function is equivalent to wanting to find a $\vec{w^*}$ for the below equation such that we obtain a ‘best fit’ (least squares) solution.

$(A\vec{w}-\vec{t}) = [0]_{N \times 1}$

Elaborating on why they are equivalent, note that the error function is a sum of squares. So the minimum value of the error function is when all the individual errors are 0. So here every row of the column vector $A\vec{w}$ is the model’s output to the $x_i$ input. So $(A\vec{w}-\vec{t})$ will consist of the $N$ error values from each $(x,t)$ pair given to us from the training data set.

I will write a post explaining how the Moore–Penrose pseudoinverse works. But for now there are many sources for that.

Therefore the best possible solution $\vec{w^*}$ is,

$\vec{w^*} = (A^TA)^{-1}.A^T.\vec{t}$

Written with StackEdit.

Ideas

Sunday, January 31, 2016

Polynomial curve fitting

No comments:

Post a Comment