Ideas: April 2016

Friday, April 22, 2016

Face Recognition

We have the data set - ORLFACEDATABASE.mat consisting of 400 images (10 images each of 40 people). Each image is of the size 48 x 48 (grey scale). Each image is reshaped to the size of 2304 x 1 (column vector). Size of the MAT file is therefore 2304 x 400

I will be using LDA (as a discriminative technique) followed by nearest mean to form a classifier.

First initialize all the necessary variables,

% Number of classes
k = 40;
% number of vectors in ith class (for all i \in [1,.. ,k])
ni = 10;
% number of features - pixels
n = 2304;

For each class (person) we can now compute the “mean vector” which is basically the mean image found from the 10 images.

% collection of class means
means = zeros(4,2304);
for i = 1:1:40
    means(i,:) = ((1/10)*sum(C(:,(1+(10*(i-1))):i*10),2))';
end

Here in ‘means’ the ith row is the ith person’s mean image vector
we pass 2 as argument to sum in order for it to add column wise
By default it adds row wise (also passing one does that). Its better to have each person’s mean vector as a column vector - like how C is arranged.

means = means';

Now we can also find the dataset mean - the average human image which I shall call meanStar


% Total mean (all data vectors)
meanStar = (1/400)*sum(C(:,:),2);

As discussed for assignment 2, I will employ multi class LDA and find $S_b$ - the scatter in between class matrix.

% Construct Sb - between class Covariance Matrix
Sb = zeros(n,n);
for i = 1:1:k
    Sb = Sb + ni*(means(:,i) - meanStar)*(means(:,i) - meanStar)';
end

Similarly we can construct $S_w$ - the scatter within class matrix,

% Construct Sw - within class Covariance matrix
Sw = zeros(n,n);
for i = 1:1:k
    % Takes the values of the within class C.M for each of the classes
    Stemp = zeros(n,n);
    % The 10 vectors of the ith class arranged as columns
    classVectors = C(:,(1+(10*(i-1))):i*10);

    for j = 1:1:ni
        Stemp = Stemp + (double(classVectors(:,j)) - means(:,j))*(double(classVectors(:,j)) - means(:,j))';
    end
    Sw = Sw + Stemp;
end

In practice, $S_w$ is often singular since the data are image vectors with large
dimensionality while the size of the data set is much smaller. Running cond on the obtained $S_w$ shows us how ill-conditioned the matrix is.

To alleviate this problem, we can perform two projections:
1) PCA is first applied to the data set to reduce its dimensionality.
2) LDA is then applied to further reduce the dimensionality.

But for now I am using discrete inverse theory, according to which adding a small value c to the diagonal of the matrix A about to be inverted, is called damping the inversion and the small value to be added c is called Marquardt-Levenberg coefficient. Sometimes matrix A has zero or close to zero eigenvalues, due to which the matrix becomes singular; adding a small damping coefficient to the diagonal elements makes it stable. Bigger is the value of c, bigger is the damping, your matrix inversion is more stable,

Stemp = Sw/(10^6) + 2*eye(n,n);
cond(Sw)

Now that we have got both $S_w$ and $S_b$ , we can find the optimal value of $\vec w$ and use it to project the data to a lower dimension.

% the first eigen vector
w = V(:,1);

% the projected points
y = ones((k*ni),1);

for i = 1:1:(k*ni)
    y(i) = real(w)'*double(C(:,i));
end

I just plotted all the points below,

enter image description here

Well I was unable to find the code to plot with 40 different colors, so for now all the points look the same, but each point represent a image. Now using this we can define boundaries for each person and use that to discriminate which person a new image is.

Written with StackEdit.

Tuesday, April 19, 2016

Dimensionality Reduction

Dimensionality reduction is a method of reducing the number of random variables under consideration via obtaining a set “uncorrelated” principle variables.

If we have $n$ features (each of them are the random variables we just talked about) then each data sample will be a $n$ dimensional vector in the $\mathcal{R}^n$ feature space. when $n$ is a large number, patterns in data can be hard to find as graphical representation is not possible.

What if we are able to identify a set of “principal variables” which are transformations of the existing $n$ variables such that we don’t lose much information.

PCA is a famous example when such data transformation are linear, there are other non linear methods too.

Here I’ll be now talking about PCA (Principal Component Analysis) and K-LDA (kernelized version of linear discriminant analysis (LDA)] ) and using these techniques in order to reduce the 2 dimensional data in this dataset into a single dimensions.

Principal Component Analysis

PCA is the orthogonal projection of the data onto a lower dimensional linear space. such that the variance of the projected data is maximized.

Variance is spread of the data, if the data is more spread, the better it is for us as we can easily separate them into clusters etc. We lack information if all the data vectors are localized around the mean.

So to perform PCA, we calculate the directions (in the $n$ dimensional feature space) in which variance is maximum (say $m$ such directions). Using these directions as basis vectors we can now project all the data vectors onto this subspace.

Here $m<n$ , and we choose the directions by calculating the eigenvectors of the covariance matrix. The eigenvectors corresponding to the larger eigenvalues will be the directions where the variance varies more.

Now I will be doing the above method in MATLAB, as part of assignment as I am in ICE department I am assigned the 6th dataset - TRAIN{1,6} so as you can see my - has 4 clusters -

So we can load the data set and initialize the clusters. After which we can plot them as a scatter graph.

load TRAINTEST2D

cluster1 = TRAIN{1,6}{1,1}; % Green
cluster2 = TRAIN{1,6}{1,2}; % Blue
cluster3 = TRAIN{1,6}{1,3}; % Red
cluster4 = TRAIN{1,6}{1,4}; % Cyan


scatter(cluster1(1,:), cluster1(2,:), 'g'); 
hold on;
scatter(cluster2(1,:), cluster2(2,:), 'b');
hold on;
scatter(cluster3(1,:), cluster3(2,:), 'r');
hold on;
scatter(cluster4(1,:), cluster4(2,:), 'c');
hold on;

DataSet initial

So we have data comprising a set of observations of $p = 2$ variables. This data can be arranged as a set of $n = 52$ data vectors $\mathbf{x}_1 \ldots \mathbf{x}_n$ with each $\mathbf{x}_i$ representing a single grouped observation of the p variables.
Here 4 clusters with 13 data vectors each, contributed to form $13 \times 4 = 52$ . Write $\mathbf{x}_1 \ldots \mathbf{x}_n$ as row vectors, each of which has p columns. Place these row vectors into a single matrix X of dimensions $n \times p$ .
Now we center our data at the origin. This is supposed to help in reducing the Mean Square Error when we approximate the data. This answer also provides some insight into how it’s necessary to compute the covariance matrix. Note that that after centering at origin, if you execute sum(B) you will get a very small number close to 0. So we can conclude the data is successfully centered.
But the images before and after centreing don’t look that different for this dataset as the empirical mean is $-0.1502 , -0.0465$ the entire data set just shifts as a whole fractionally.

% number of data vectors - 13*4
n = 52;
% number of dimensions (before reduction)
p = 2; 
% arrange data vectors as rows
X = [cluster1';cluster2'; cluster3'; cluster4'];
% find both x and y's mean - empirical mean vector
u = (1/n)*sum(X);
% subtract from the mean - Thus achieving centering
B = X - ones(52,1)*u;


% Plot the data after centering it around the Origin
figure(2);
scatter(B(1:13,1), B(1:13,2), 'g'); 
hold on;
scatter(B(14:26,1), B(14:26,2), 'b');
hold on;
scatter(B(27:39,1), B(27:39,2), 'r');
hold on;
scatter(B(40:52,1), B(40:52,2), 'c');
hold on;
legend('cluster 1','cluster 2','cluster 3','cluster 4');

dataset after centreing

Next I will find the empirical covariance matrix from the outer product of matrix B with itself.

C = (B'*B)/(n-1);

Then we can find the eigenvectors of the covariance matrix, choose the one which has a larger eigenvalue (In this data set $1.2>1.01$ ). Now we can project all the 2D points on the line (in the direction of that particular eigenvector).

% we get a column vector of eigen values - the first will be the one with
% the largest magnitude
E = svds(C);
% V will have the eigen vectors
[V,D] = eig(C);

% Thus the axis along which variance is Maximum is
maxVarianceAxis = V(:,2);

% No need to divide by (maxVarianceAxis'*maxVarianceAxis) as
% eig generates orthonormal eigenvectors
projectionMatrix = (maxVarianceAxis*maxVarianceAxis');

projCluster1 = projectionMatrix*cluster1;
projCluster2 = projectionMatrix*cluster2;
projCluster3 = projectionMatrix*cluster3;
projCluster4 = projectionMatrix*cluster4;

% Plot the data after projecting onto the principle component
figure(3);
scatter(projCluster1(1,:), projCluster1(2,:), 'g'); 
hold on;
scatter(projCluster2(1,:), projCluster2(2,:), 'b');
hold on;
scatter(projCluster3(1,:), projCluster3(2,:), 'r');
hold on;
scatter(projCluster4(1,:), projCluster4(2,:), 'c');
hold on;
legend('cluster 1','cluster 2','cluster 3','cluster 4');

After projecting onto principle component axis

As we can see the separation of blue and green clusters is good, but the light blue cluster is getting mixed up with the red cluster.

Before I wrap up, I’d like to show how the projection these data vectors onto the other eigenvector looks like,

enter image description here

See how the red and light blue is well separated here (But not as well separated green and the dark blue in the projection onto the eigenvector corresponding to the largest eigenvalue. The variance shown here is only slightly larger compared to the answer. (because the eigenvalues are close - about 0.2 difference) .

So here we had 2 features which we reduced to a single feature (the position of the data vector on the line - a single value if expressed as the distance from the origin). Thus we successfully implemented PCA and extracted a abstract feature where the data has maximum variance - globally thereby reducing dimensionality.

This leads us to LDA, where we no longer look at this same dataset as $52$ data vectors rather we take into consideration that that they are $13$ vectors from $4$ different classes.

Linear Discriminant Analysis

This is a method to find a linear combination of features that characterizes or separates two or more classes of objects.

This method is also known as Fisher’s linear discriminant, the idea is that we don’t wish to project such that the overall variance is maximized, rather we want the the clusters to be spread out (maximize the between-class covariance matrix after projection) while the data vectors of a particular class should be closer to the mean of that class (within-class covariance matrix for all the classes should be minimized).

The idea is that this enables us to easily discriminate new data vectors into classes. The more clearly separated the classes are, the less ambiguity we face when classifying.

I’ll explain using 2 classes and then extend it to multiple classes (in this dataset we are using 4 classes). let there be 2 classes - $C_1, C_2$ , having $n_1, n_2$ number of vectors respectively. $x_{ij}$ is the $i^{th}$ vector in class $j$ . So the mean of the classes will be,
$m_1 = \frac{1}{n_1}\sum_{i = 1}^{n_1}x_{i1}$
$m_2 = \frac{1}{n_2}\sum_{i = 1}^{n_2}x_{i2}$

the projection from $n$ to $1$ dimensions is done using $\vec w$ ,
$y_{ij} = \vec w^Tx_{ij}$
where the sizes are,
$1 \times 1 = (1 \times n) \times (n \times 1)$

The simplest measure of the separation of the classes, when projected along the line defined by $\vec w$ , is the square of the separation between the projected class means. If the mean after projection of class 1 and 2 are - $\bar y_1$ and $\bar y_2$ then,
$\bar y_2 - \bar y_1 = \vec w^Tm_2 - \vec w^Tm_1 = \vec w^T(m_2 - m_1)$
$(\bar y_2 - \bar y_1)^2 = \vec w^T(m_2 - m_1)(m_2 - m_1)^T\vec w = \vec w^TS_B\vec w$

Here we define the between-class covariance matrix $S_B$ as,
$S_B = (m_2 - m_1)(m_2 - m_1)^T$

Thus $\vec w^TS_B\vec w$ is to be maximized.

While we minimize the sum of the within-class variance’s for each of the classes - $\sum_{j=1}^{k} \sigma^2_j$ .

The within-class variance of the transformed data for the class $C_j$ is given by,
$\sigma^2_j = \sum_{i = 1}^{n_j}(y_{ij} - \bar y_j)^2 = \sum_{i = 1}^{n_j}(\vec w^Tx_{ij} - \vec w^Tm_j)^2$
$\sigma^2_j = \sum_{i =1}^{n_j} \vec w^T(x_{ij} - m_j)(x_{ij} - m_j)^T \vec w$
So after taking $\vec w$ terms out of the sigma and rewriting the summation in terms of the total within-class covariance matrix - $S_W$ ,

$S_W = \sum_{i = 1}^{n_1} (x_{i1} - m_1)(x_{i1} - m_1)^T + \sum_{i = 1}^{n_2} (x_{i2} - m_2)(x_{i2} - m_2)^T$

For 2 classes the measure of how close the data vectors are to thier respective classes mean is $\sigma^2_1 +\sigma^2_2$

$\sigma^2_1 +\sigma^2_2 = \vec w^T S_W \vec w$

Thus $\vec w^TS_W\vec w$ is to be minimized.

So with this we can construct the utility function which we need to maximize,

$J(\vec w) = \frac{ |\vec w^TS_B \vec w|}{ |\vec w^TS_W\vec w|}$

Differentiating $J(\vec w$ ) with respect to $\vec w$ , setting it equal to zero, and rearranging gives,
$\vec w \propto \mathbf{S}^{-1}_W(\mathbf{m}_2-\mathbf{m}_1)$

Multi Class LDA

Thus we have found the direction of $\vec w$ for which our conditions are optimally achieved.

But extending this idea to multiple classes requires us to make some assumptions - so as to make the mathematics a little more easier.

Suppose that now there are $k$ classes. The within-class covariance matrix which is calculated is,
$S_W = S_1 + ... + ... + S_n = \sum_{i = 1}^k \sum_{x\in c_i} (x-\bar x_i)(x-\bar x_i)^T$

While the between-class covariance matrix is given by,
$S_B = \sum_{i=1}^k n_i (\bar x_i - \bar x)(\bar x_i - \bar x)^T$

In the above 2 equations $\bar x$ is the mean of all the data vectors, $\bar x_i$ is the mean of the data vectors in the $i^{th}$ class.

Using the revised covariance matrices and plugging them into the utility function. We will find that the optimal solution for $\vec w$ is,

$S_W^{-1}S_B\vec w = \lambda \vec w$

Optimal $\vec w$ is an eigenvector of $S_W^{-1} S_B$ .

Kernel Trick + LDA = K-LDA

K-LDA the kernelized version of linear discriminant analysis (LDA) (also known as kernel Fisher discriminant analysis or kernel discriminant analysis) is when LDA is implicitly performed in a new feature space, which allows non-linear mappings to be learned.

For most real-world data a linear discriminant (offered by LDA) is not complex enough, because the data vectors might not be linearly separable at all.

We first map the data non-linearly into some feature space $\mathcal{F}$ (a higher dimensional space) and compute LDA there (For different kernel functions these mappings differ). The linear separation which LDA provides in the feature space will yield a non-linear discriminant in the original input space. You can think of a 2 clusters (in the 2d plane) with one of the clusters embedded inside the other, now if we map these vectors to 3 dimensions we can easily introduce a dummy variable (the third axis) such that the 2 clusters are easily separated linearly (by a plane). As you can see below the left side image is the Input space and the right side image is the feature space.

why use feature space

So as we are operating in the feature space, the only change in the LDA explained above is that we use $\phi (x_i)$ instead of $x$ everywhere. Explicitly computing the mappings $\phi(x_i)$ and then performing LDA can be computationally expensive, and in many cases intractable. For example, $\mathcal{F}$ may be infinitely dimensional (as in the case of the gaussian kernel).

So instead of designing a nonlinear transformation and calculating the transformed data vectors in the feature space - $\mathcal{F}$ , we can rewrite the method in terms of dot products and using the kernel trick in which the dot product in the new feature space is replaced by a kernel function.

$k(\mathbf{x},\mathbf{y})=\phi(\mathbf{x})\cdot\phi(\mathbf{y})$

I used this paper and also refered from wikipedia - we can rewrite the equation for $J(\vec w)$ in terms of only dot products,

In the following I am assuming these sizes apply,
$(\mathbf{x}_i^j)_{n \times 1} , (\alpha_i)_{1 \times 1} , (\vec w)_{n \times 1} , (\alpha)_{n_T \times 1} (m_i^{\phi} )_{n \times 1} , (M_i)_{n_T \times 1}$
$((M_i)_j)_{1 \times 1} , (\mathbf{K}_i)_{n_T \times n_i} , M_{n_T \times n_T} , N_{n_T \times n_T}$

The remaining can be understood implicitly, here there are $n$ features, and the projection is onto 1 dimensions. These $n_T$ data vectors are separated into $k$ classes with the $i^{th}$ class having $n_i$ number of data vectors.

First of all $\vec w$ should lie on the span of all training samples $\mathcal{F}$ (read on theory of reproducing kernels to understand why)
$\vec w = \sum_{i=1}^{n_T}\alpha_i\phi(\mathbf{x}_i)$

where $n_T = \sum n_i$ , is the total number of vectors.

Now the dot product of the mean of the $i^{th}$ class (in the feature space) $m_i^{\phi}$ with $\vec w$ will give,

$\vec w^Tm_i^{\phi} = \alpha^T M_i$

Here $\alpha$ is the set of $\alpha_i$ all piled together to form a vector. the mean of each class in the feature space is defined just like how we had discussed in the LDA section.

$m_i^{\phi} = \frac{1}{n_i}\sum_{l=1}^{n_i}\phi(\mathbf{x}_l^i)$

Using the equation for $\vec w$ and $m_i^{\phi}$ we can say,

$\vec w^{\text{T}} m_i^{\phi} = \frac{1}{n_i}\sum_{j=1}^{n}\sum_{l=1}^{n_i}\alpha_jk(\mathbf{x}_j,\mathbf{x}_l^i) = \mathbf{\alpha}^{\text{T}}\mathbf{M}_i$
where $M_i$ is a vector whose $j^{th}$ element is defined as,

$(\mathbf{M}_{i})_j = \frac{1}{n_T}\sum_{l=1}^{n_i}k(\mathbf{x}_j,\mathbf{x}_l^i)$

Here $\mathbf{x}_j$ is the $j^{th}$ data vector among ALL the vectors while $\mathbf{x}_l^i$ is the $l^{th}$ vector in the $i^{th}$ class.

If we define a matrix M as defined below,
$\begin{align} M & = \sum_{i=1}^k n_i(\mathbf{M}_i-\mathbf{M}_{*})(\mathbf{M}_i-\mathbf{M}_{*})^{\text{T}} \\ \end{align}$

Where the $j^{th}$ element in the vector $\mathbf{M}_{*}$ is defined as,

$(\mathbf{M}_{*})_j = \frac{1}{n_T}\sum_{l=1}^{n_T}k(\mathbf{x}_j,\mathbf{x}_l)$

Then the numerator of the utility function can be written as,
$\vec w^T S_B^{\phi} \vec w = \alpha^T M \alpha$

Here $S_B^{\phi}$ is just the between-class covariance matrix defined for the transformed data vectors (in the feature space).

Moving on to the denominator, if we define a matrix $N$ as

$\begin{align} N & = \sum_{i=1}^k\mathbf{K}_i(\mathbf{I}_{n_i \times n_i}-\mathbf{1}_{n_i})\mathbf{K}_i^{\text{T}} \end{align}$

with the $(n^{\text{th}},m^{\text{th}})$ component of $\mathbf{K}_i$ defined as $k(\mathbf{x}_n,\mathbf{x}_m^i)$ , Also here $1_{n_i}$ is the matrix with all entries $\frac{1}{n_i}$ .

Here $\mathbf{K}_i$ is known as the kernel matrix for class $i$ .

Then the denominator of the utility function can be written as,
$\vec w^T S_W^{\phi} \vec w = \alpha^T N \alpha$

Thus using both of the highlighted equations the utility function in the feature space is,

$J(\alpha) = \frac{ |\vec w^TS_B^{\phi} \vec w|}{ |\vec w^TS_W^{\phi}\vec w|} = \frac{|\alpha^T M \alpha|}{|\alpha^T N \alpha|}$

This problem can be solved (just like how LDA was solved in the input space) and the optimal value for $\alpha$ is the leading eigenvector of $N^{-1}M$ .

MATLAB Implementation

The above equations can be implemented in MATLAB as follows. Note there is 2 features and we wish to reduce it a single dimension. There are 4 classes ( $k = 4$ ) each class having 13 vectors ( $n_i \forall i = 13$ ) and so the total number of vectors ( $n_T = 52$ ).

As usual I’ll go ahead and use the Gaussian kernel function. then initialize all the variables.

load TRAINTEST2D

cluster1 = TRAIN{1,6}{1,1}; % Green
cluster2 = TRAIN{1,6}{1,2}; % Blue
cluster3 = TRAIN{1,6}{1,3}; % Red
cluster4 = TRAIN{1,6}{1,4}; % Cyan

nT = 52; % Total number of data Vectors
n1 = 13; % # vectors in cluster 1
n2 = 13; % # vectors in cluster 2
n3 = 13; % # vectors in cluster 3
n4 = 13; % # vectors in cluster 4

% Plot the data before dimensionality reduction
figure(1);
scatter(cluster1(1,:), cluster1(2,:), 'g'); 
hold on;
scatter(cluster2(1,:), cluster2(2,:), 'b');
hold on;
scatter(cluster3(1,:), cluster3(2,:), 'r');
hold on;
scatter(cluster4(1,:), cluster4(2,:), 'c');
hold on;
legend('cluster 1','cluster 2','cluster 3','cluster 4');

Now I’ll construct the Gram Matrix for all the data vectors,

% CONSTRUCT KERNEL MATRIX
gamma = 10;

% Arrange all the clusters together
X = [cluster1';cluster2'; cluster3'; cluster4'];
K = ones(nT,nT);

% Construct the Gram Matrix 
% K(i,:) - gives the ith vector out of ALL vectors
for i = 1:1:nT
    for j = 1:1:nT
        K(i,j) = kernelGauss(X(i,:),X(j,:),gamma);
    end
end

Following the equations given above we can construct M,

% Okay K(i,j) = K(x_i,x_j) Now as X(1:13,:), X(14:26,:), X(27:39,:), 
% X(40:52,:) are the 4 clusters respectively. We can calculate M_i as
% the numbers are like that due to how X was constructed.

M_1 = ones(nT,1);
M_2 = ones(nT,1);
M_3 = ones(nT,1);
M_4 = ones(nT,1);
M_star = ones(nT,1);

% sum(K(1:13,j)) is the sum of k(x,x_j) over all x belonging to cluster 1
% sum(K(14:26,j)) is the sum of k(x,x_j) over all x belonging to cluster 2
% sum(K(27:39,j)) is the sum of k(x,x_j) over all x belonging to cluster 3
% sum(K(40:52,j)) is the sum of k(x,x_j) over all x belonging to cluster 4
% sum(K(:,j)) is the sum of k(x,x_j) over ALL x (all clusters)

for j = 1:1:nT
    M_1(j) = (1/nT)*sum(K(1:13,j));
end
for j = 1:1:nT
    M_2(j) = (1/nT)*sum(K(14:26,j));
end
for j = 1:1:nT
    M_3(j) = (1/nT)*sum(K(27:39,j));
end
for j = 1:1:nT
    M_4(j) = (1/nT)*sum(K(40:52,j));
end

for j = 1:1:nT
    M_star(j) = (1/nT)*sum(K(:,j));
end

% Thus we can construct M
M = n1*(M_1-M_star)*(M_1-M_star)' + n2*(M_2-M_star)*(M_2-M_star)' + n3*(M_3-M_star)*(M_3-M_star)' + n4*(M_4-M_star)*(M_4-M_star)';

For constructing N, we will first separate the gram matrix we had made into the kernel matrices for each of the classes.

% Now we shall construct the kernel matrices for each of the clusters
K_1 = K(:,1:13);
K_2 = K(:,14:26);
K_3 = K(:,27:39);
K_4 = K(:,40:52);

% Thus we can construct N
N = K_1*(eye(n1,n1)-(1/n1)*ones(n1,n1))*K_1' + K_2*(eye(n2,n2)-(1/n2)*ones(n2,n2))*K_2' + K_3*(eye(n3,n3)-(1/n3)*ones(n3,n3))*K_3' + K_4*(eye(n4,n4)-(1/n4)*ones(n4,n4))*K_4';

Now compute $N^{-1}M$ and its eigenvalues - eigenvectors. Note that as N is usually singular we add a multiple of $I$ for inv(N) to become computable.

% Note that in practice, \mathbf{N} is usually singular and 
% so a multiple of the identity is added to it
N = N + 2*eye(nT,nT);
cond(N) % is N ill conditioned? 

% Now that we have both N and M, we can find all the eigenvectors
% and choose most prominent eigenvectors
P = inv(N)*M;
[V,D] = eig(P);

Now we can project the data onto a line using the leading eigenvector which has a eigenvalue of 3.07.

% the first eigenvector has the largest corresponding eigenvalue - 3.07
% after which 0.4, 0.28, 0.26 Then it drops down...
alpha = V(:,1);

% the projected points
y = ones(nT,1);

for i = 1:1:nT
    y(i) = alpha'*K(:,i);
end

projCluster1 = y(1:13);
projCluster2 = y(14:26);
projCluster3 = y(27:39);
projCluster4 = y(40:52);

% Plot the projected data
figure(3);
scatter(projCluster1, zeros(13,1), 'g'); 
hold on;
scatter(projCluster2, zeros(13,1), 'b');
hold on;
scatter(projCluster3, zeros(13,1), 'r');
hold on;
scatter(projCluster4, zeros(13,1), 'c');
hold on;
legend('cluster 1','cluster 2','cluster 3','cluster 4');

The plot we get for gamma (of the kernel function) being 10 is,

After projection

If we are able to try different values for the kernel function we can surely get great separation of classes.

So in conclusion both methods can be viewed as techniques for linear
dimensionality reduction. However, PCA is unsupervised and depends only on the data vectors (maximize its variance using less features) whereas Fisher linear discriminant (KLDA) also uses class-label information to bring the data vectors closer to other data vectors in the same class while maximizing the variance between classes.

Written with StackEdit.

Thursday, April 14, 2016

Equivalent Kernel

The prediction of the output for a input $x_o$ is,
$y(x_o,m_N) = \vec w_{optimal}^T\vec \phi(x_o)= m_N^T\vec \phi(x_o) = \vec \phi(x_o)^T m_N$
$= \vec \phi(x_o)^T\beta S_N\Phi^T\vec t = \beta \sum_{i=1}^n\vec \phi(x_o)^T\Phi^TS_N\vec \phi(x_i)^Tt_i = \sum_{i=1}^nk(x_o, x_i)t_i$

here the function $k$ is called smoother matrix or the equivalent kernel.

Note that we are substituting the solution for $\vec w$ which is $m_N$ from when we maximize the posterior p.d.f

This effective kernel defines the weights by which the training
set target values are linearly combined in order to make a prediction for a new value of x.

Here scalar $k(x_o, x_i)$ can be thought of as a measure of how important the $i$ th input-output vector pair ( the $x_i \rightarrow t_i$ mapping in the Training DataSet) is; for the prediction of the output corresponding to input $x_o$ .

For example in the gaussian kernel, the closer $x_o$ is to $x_i$ , the larger the magnitude of the kernel function. Thus those corresponding $t_i$ will be weighted higher.

This is visually shown below. when we work with an actual dataset.

You can read more about this in the wiki page for Kernel Smoother.

Kernel Matrix

A Kernel Matrix - $K_{M \times N}$ operates on the training set and produces $M$ predictions, $M$ is the number of input vectors ( $x_i^1, x_i^2, x_i^3, ... x_i^M)$ , and there are $N$ data samples { $(x_o^1,t_o^1), (x_o^2,t_o^2), (x_o^3,t_o^3), ... (x_o^N,t_o^N)$ } in the training dataset.

$K = \begin{array}{cc} k(x_o^1,x_i^1) & k(x_o^1,x_i^2)& ... & k(x_o^1,x_i^N) \\ k(x_o^2,x_i^1)& k(x_o^2,x_i^2) & ... & k(x_o^2,x_i^N)\\ ... & ... & ... & ... \\ k(x_o^M,x_i^1)& ... & ... & k(x_o^M,x_i^N)\\ \end{array}$

$\vec t_{N \times 1} = [t_o^1, t_o^2, t_o^3, ... t_o^N]^T$

Here the predictions for the input vectors will be a vector - $\vec p$ ,

$\vec p = K\vec t$

If we wish to form a “model” of sorts (which gives us predictions) instead of laboriously computing the kernels every time we wish to make a prediction. We can form the kernel and thus for any new input we can find the closest $x_o$ . In order to save memory space and even ensure accuracy we can set a resolution $r$ and operate on normalized data.

So if we consider a resolution of 0.01 and the normalization range to be [0,1]. Then the input vectors ( $x_i^1, x_i^2, x_i^3, ... x_i^M)$ will be $(0,0) , (0,0.01), ... (0.01,0) , (0.01,0.01), .... .... (1,1)$ if we consider the input to be 2 dimensional.

So even though it will take a long time to compute the entire matrix and even experiment with various kernel parameter values to finally obtain the kernel matrix. But once we have it we can quickly finish predicting for new inputs.

Now I’ll use the above concepts to form such a model using this dataset. Find the whole code here, please ensure you have kernel Functions defined in separate files and added to MATLAB’s path.

% The exponential kernel is closely related to the Gaussian kernel, with only the 
% square of the norm left out. It is also a radial basis function kernel.
function [res] = kernelExp(X1, X2, c)
    res = exp( -(norm(X1-X2)) * c );

First load the data into the workspace,

load('ASSIGNMENT1.mat');

Separate DATA into the training dataset and testing dataset.

% Training DataSet - 7,500 data samples
inputTrain = DATA(1:7500,1:2); % each row is a 1*2 input vector
outputTrain = DATA(1:7500,3:4); % each row is a 1*2 target vector

% Testing DataSet
inputTest = DATA(7501:10000,1:2);
outputTest = DATA(7501:10000,3:4);

Now depending on how the input is normalized (in our case it’s 0 to 1), get the parameters initialized. The larger the resolution the longer the forming of the model will take, but the model will serve to predict a larger number of inputs with great accuracy (we won’t have to round off to the closest). I am preallocating the modelInputsmatrix (as when an array is growing by assignment or concatenation it affects the code performance).

r = 0.1; % Resolution
% The number of X_o's (as input is 1*2) and normalized to [0,1]
M = int8( ((1/r)+1)*((1/r)+1) ); 
N = 7500; % 75% of the total 10,000 samples are used as the training set.
modelInputs = zeros(M,2);
NormalisedInputs = 0:r:1;

Construct the set of input vectors using NormalisedInputs, by taking all pairwise combinations of it (as we need a 1*2 vector).

for i = 1:1:length(NormalisedInputs)
    for j = 1:1:length(NormalisedInputs)
        %modelInputs = [modelInputs; [NormalisedInputs(i),NormalisedInputs(j)]];
        modelInputs(j+length(NormalisedInputs)*(i-1),1:2) = [NormalisedInputs(i),NormalisedInputs(j)];
    end
end

For a particular value of gamma now we can construct the matrix,

K = zeros(M,N);
gamma = 0.1;

for i = 1:1:M
    for j = 1:1:N
        K(i,j) = kernelGauss(modelInputs(i,1:2),X_i(j,1:2),gamma);
    end
end

Using the Gaussian basis function and the Sigmoidal basis function, we can compute the kernel matrix (smoothing matrix).

They are the following,
$\Large k_{\gamma}(\vec x_o,\vec x_i) = e^{- \gamma (|| \vec x_o - \vec x_i||^2) }$
$\Large k_{\alpha, c}(\vec x_o,\vec x_i) = \tanh (\alpha \vec x_o^T \vec x_i + c)$

K_gauss = zeros(M,N);
K_sigmoid = zeros(M,N);
gamma = 10;
alpha = 1;
c = 1;

for i = 1:1:M
    for j = 1:1:N
        K_gauss(i,j) = kernelGauss(modelInputs(i,1:2),X_i(j,1:2),gamma);
        K_sigmoid(i,j) = kernelHyperTangent(modelInputs(i,1:2),X_i(j,1:2),alpha,c);
    end
end

At this point I would like to show visually how the weighing of the training set works, as you can see each row of the $K$ matrix corresponds to a single input while the columns are for each of the training samples. So if you run the following code,

Note that I tried for about 2 days to find the MATLAB syntax to represent the X axis as the first input parameter, Y axis as the second input parameter. So when we take say resolution of 0.1. then that would be the scale of the X and Y axis. Now using Z axis we can represent the weight we attribute to that input. So it would be a scatter plot with (x,y,z) where (x,y) is the input and z is its corresponding weight. In this way it would be clearly seen as a hill how the inputs similar to the training input under consideration gets high weightage.

The following code displays the weights corresponding to all 121 input vectors from (0,0) to (1,1) (when we took resolution = 0.1) for the Pth training sample. Here M is 121 cause that many $1\times2$ input vectors exist from 0 to 1 with a resolution of 0.1

plot(K_gauss(1:M,P));

When P = 98 (the index of the training sample is 98) with training input - (0.396, 0.495), we get the following plot.

weight distribution

See how we have 5 significant peaks in the weights? if we had plotted it in 3d we could have seen all these peaks were actually close together.
The largest peak is at index 50 which happens to be the input vector (0.4,0.5). Note how similar it is to the training input we are comparing with.
The next 2 peaks are at 39 and 61, which are (0.3,0.5) and (0.5,0.5). So we see a unit resolution change also counts as close.
The final 2 are 28 and 72 which are (0.2,0.5) and (0.6,0.5).

So this shows us how $k(x_o, x_i)$ is larger the closer $x_o$ is to $x_i$ .

Now the question arises how much larger should the weights be, if they are close, rather how should the weights vary as the distance between them - $(x_o, x_i)$ changes. This is controlled using the $\gamma$ parameter, I used a value of 50 and obtained the above graph. Note that a larger gamma, means that more emphasis on the closeness. So for $\gamma = 500$ , you will get a single peak at the the 50th index.

impulse lol

This much must have got you a great idea on how this gaussian kernel works as an instance based learner.

Ok so for a particular training sample (i.e. $98^{th}$ ) we plotted the weight values (corresponding to each of the input vectors - 121 of them) and showed how the input vectors which are “closer” to that particular training sample’s input is given a larger weight.

Cool so to predict the output for a particular input vector - we use all the weights and weigh ALL the training sample’s outputs.

But first we need to normalize the weights to 1 as we use these weights to only contribute the relative importance of the various training set’s outputs.

% This vector's ith element is the sum of weights used to predict the 
% output for the ith input vector.
sumOfWeights = zeros([M,1]);
for i = 1:1:M
    sumOfWeights(i) = sum(K_gauss(i,1:N));
end

predictions = K_gauss * outputTrain;
% Normalize it
predictions(1:M,1) = predictions(1:M,1)./ sumOfWeights;
predictions(1:M,2) = predictions(1:M,2)./ sumOfWeights;

Moving on to testing this model, we can calculate the MSE when we compare the results of the prediction to the actual outputs.

% Now we are going to test this model - calculate MSE error
L = 10000 - N; % remmaining samples in the data set are used to test
mse = 0;

for i = 1:1:L
    % in this iteration we wish to find the output for inputTest(i,1:2)
    modelInput = inputTest(i,1:2);

    % First we need to find which index this input is closest to
    % in modelInputs
    X1 = round(modelInput(1), numberOfDecimals);
    X2 = round(modelInput(2), numberOfDecimals);

    index1 = find(modelInputs(1:M,1) == X1);
    index2 = find(modelInputs(1:M,2) == X2);
    index = intersect(index1,index2);

    % so the corresponding prediction will be
    modelOutput = predictions(index,1:2);
    actualOuput = outputTest(i,1:2);

    squaredError = (modelOutput - actualOuput)*(modelOutput - actualOuput)';
    mse = mse + squaredError;
end

% as its the Mean we gotta divide with number of samples.
mse = mse/L;

With $\gamma = 50$ using the gaussian kernel we get a mean square error of $1.1987$ which is reasonable.

Written with StackEdit.