Ideas: Dimensionality Reduction

Dimensionality reduction is a method of reducing the number of random variables under consideration via obtaining a set “uncorrelated” principle variables.

If we have $n$ features (each of them are the random variables we just talked about) then each data sample will be a $n$ dimensional vector in the $\mathcal{R}^n$ feature space. when $n$ is a large number, patterns in data can be hard to find as graphical representation is not possible.

What if we are able to identify a set of “principal variables” which are transformations of the existing $n$ variables such that we don’t lose much information.

PCA is a famous example when such data transformation are linear, there are other non linear methods too.

Here I’ll be now talking about PCA (Principal Component Analysis) and K-LDA (kernelized version of linear discriminant analysis (LDA)] ) and using these techniques in order to reduce the 2 dimensional data in this dataset into a single dimensions.

Principal Component Analysis

PCA is the orthogonal projection of the data onto a lower dimensional linear space. such that the variance of the projected data is maximized.

Variance is spread of the data, if the data is more spread, the better it is for us as we can easily separate them into clusters etc. We lack information if all the data vectors are localized around the mean.

So to perform PCA, we calculate the directions (in the $n$ dimensional feature space) in which variance is maximum (say $m$ such directions). Using these directions as basis vectors we can now project all the data vectors onto this subspace.

Here $m<n$ , and we choose the directions by calculating the eigenvectors of the covariance matrix. The eigenvectors corresponding to the larger eigenvalues will be the directions where the variance varies more.

Now I will be doing the above method in MATLAB, as part of assignment as I am in ICE department I am assigned the 6th dataset - TRAIN{1,6} so as you can see my - has 4 clusters -

So we can load the data set and initialize the clusters. After which we can plot them as a scatter graph.

load TRAINTEST2D

cluster1 = TRAIN{1,6}{1,1}; % Green
cluster2 = TRAIN{1,6}{1,2}; % Blue
cluster3 = TRAIN{1,6}{1,3}; % Red
cluster4 = TRAIN{1,6}{1,4}; % Cyan


scatter(cluster1(1,:), cluster1(2,:), 'g'); 
hold on;
scatter(cluster2(1,:), cluster2(2,:), 'b');
hold on;
scatter(cluster3(1,:), cluster3(2,:), 'r');
hold on;
scatter(cluster4(1,:), cluster4(2,:), 'c');
hold on;

DataSet initial

So we have data comprising a set of observations of $p = 2$ variables. This data can be arranged as a set of $n = 52$ data vectors $\mathbf{x}_1 \ldots \mathbf{x}_n$ with each $\mathbf{x}_i$ representing a single grouped observation of the p variables.
Here 4 clusters with 13 data vectors each, contributed to form $13 \times 4 = 52$ . Write $\mathbf{x}_1 \ldots \mathbf{x}_n$ as row vectors, each of which has p columns. Place these row vectors into a single matrix X of dimensions $n \times p$ .
Now we center our data at the origin. This is supposed to help in reducing the Mean Square Error when we approximate the data. This answer also provides some insight into how it’s necessary to compute the covariance matrix. Note that that after centering at origin, if you execute sum(B) you will get a very small number close to 0. So we can conclude the data is successfully centered.
But the images before and after centreing don’t look that different for this dataset as the empirical mean is $-0.1502 , -0.0465$ the entire data set just shifts as a whole fractionally.

% number of data vectors - 13*4
n = 52;
% number of dimensions (before reduction)
p = 2; 
% arrange data vectors as rows
X = [cluster1';cluster2'; cluster3'; cluster4'];
% find both x and y's mean - empirical mean vector
u = (1/n)*sum(X);
% subtract from the mean - Thus achieving centering
B = X - ones(52,1)*u;


% Plot the data after centering it around the Origin
figure(2);
scatter(B(1:13,1), B(1:13,2), 'g'); 
hold on;
scatter(B(14:26,1), B(14:26,2), 'b');
hold on;
scatter(B(27:39,1), B(27:39,2), 'r');
hold on;
scatter(B(40:52,1), B(40:52,2), 'c');
hold on;
legend('cluster 1','cluster 2','cluster 3','cluster 4');

dataset after centreing

Next I will find the empirical covariance matrix from the outer product of matrix B with itself.

C = (B'*B)/(n-1);

Then we can find the eigenvectors of the covariance matrix, choose the one which has a larger eigenvalue (In this data set $1.2>1.01$ ). Now we can project all the 2D points on the line (in the direction of that particular eigenvector).

% we get a column vector of eigen values - the first will be the one with
% the largest magnitude
E = svds(C);
% V will have the eigen vectors
[V,D] = eig(C);

% Thus the axis along which variance is Maximum is
maxVarianceAxis = V(:,2);

% No need to divide by (maxVarianceAxis'*maxVarianceAxis) as
% eig generates orthonormal eigenvectors
projectionMatrix = (maxVarianceAxis*maxVarianceAxis');

projCluster1 = projectionMatrix*cluster1;
projCluster2 = projectionMatrix*cluster2;
projCluster3 = projectionMatrix*cluster3;
projCluster4 = projectionMatrix*cluster4;

% Plot the data after projecting onto the principle component
figure(3);
scatter(projCluster1(1,:), projCluster1(2,:), 'g'); 
hold on;
scatter(projCluster2(1,:), projCluster2(2,:), 'b');
hold on;
scatter(projCluster3(1,:), projCluster3(2,:), 'r');
hold on;
scatter(projCluster4(1,:), projCluster4(2,:), 'c');
hold on;
legend('cluster 1','cluster 2','cluster 3','cluster 4');

After projecting onto principle component axis

As we can see the separation of blue and green clusters is good, but the light blue cluster is getting mixed up with the red cluster.

Before I wrap up, I’d like to show how the projection these data vectors onto the other eigenvector looks like,

enter image description here

See how the red and light blue is well separated here (But not as well separated green and the dark blue in the projection onto the eigenvector corresponding to the largest eigenvalue. The variance shown here is only slightly larger compared to the answer. (because the eigenvalues are close - about 0.2 difference) .

So here we had 2 features which we reduced to a single feature (the position of the data vector on the line - a single value if expressed as the distance from the origin). Thus we successfully implemented PCA and extracted a abstract feature where the data has maximum variance - globally thereby reducing dimensionality.

This leads us to LDA, where we no longer look at this same dataset as $52$ data vectors rather we take into consideration that that they are $13$ vectors from $4$ different classes.

Linear Discriminant Analysis

This is a method to find a linear combination of features that characterizes or separates two or more classes of objects.

This method is also known as Fisher’s linear discriminant, the idea is that we don’t wish to project such that the overall variance is maximized, rather we want the the clusters to be spread out (maximize the between-class covariance matrix after projection) while the data vectors of a particular class should be closer to the mean of that class (within-class covariance matrix for all the classes should be minimized).

The idea is that this enables us to easily discriminate new data vectors into classes. The more clearly separated the classes are, the less ambiguity we face when classifying.

I’ll explain using 2 classes and then extend it to multiple classes (in this dataset we are using 4 classes). let there be 2 classes - $C_1, C_2$ , having $n_1, n_2$ number of vectors respectively. $x_{ij}$ is the $i^{th}$ vector in class $j$ . So the mean of the classes will be,
$m_1 = \frac{1}{n_1}\sum_{i = 1}^{n_1}x_{i1}$
$m_2 = \frac{1}{n_2}\sum_{i = 1}^{n_2}x_{i2}$

the projection from $n$ to $1$ dimensions is done using $\vec w$ ,
$y_{ij} = \vec w^Tx_{ij}$
where the sizes are,
$1 \times 1 = (1 \times n) \times (n \times 1)$

The simplest measure of the separation of the classes, when projected along the line defined by $\vec w$ , is the square of the separation between the projected class means. If the mean after projection of class 1 and 2 are - $\bar y_1$ and $\bar y_2$ then,
$\bar y_2 - \bar y_1 = \vec w^Tm_2 - \vec w^Tm_1 = \vec w^T(m_2 - m_1)$
$(\bar y_2 - \bar y_1)^2 = \vec w^T(m_2 - m_1)(m_2 - m_1)^T\vec w = \vec w^TS_B\vec w$

Here we define the between-class covariance matrix $S_B$ as,
$S_B = (m_2 - m_1)(m_2 - m_1)^T$

Thus $\vec w^TS_B\vec w$ is to be maximized.

While we minimize the sum of the within-class variance’s for each of the classes - $\sum_{j=1}^{k} \sigma^2_j$ .

The within-class variance of the transformed data for the class $C_j$ is given by,
$\sigma^2_j = \sum_{i = 1}^{n_j}(y_{ij} - \bar y_j)^2 = \sum_{i = 1}^{n_j}(\vec w^Tx_{ij} - \vec w^Tm_j)^2$
$\sigma^2_j = \sum_{i =1}^{n_j} \vec w^T(x_{ij} - m_j)(x_{ij} - m_j)^T \vec w$
So after taking $\vec w$ terms out of the sigma and rewriting the summation in terms of the total within-class covariance matrix - $S_W$ ,

$S_W = \sum_{i = 1}^{n_1} (x_{i1} - m_1)(x_{i1} - m_1)^T + \sum_{i = 1}^{n_2} (x_{i2} - m_2)(x_{i2} - m_2)^T$

For 2 classes the measure of how close the data vectors are to thier respective classes mean is $\sigma^2_1 +\sigma^2_2$

$\sigma^2_1 +\sigma^2_2 = \vec w^T S_W \vec w$

Thus $\vec w^TS_W\vec w$ is to be minimized.

So with this we can construct the utility function which we need to maximize,

$J(\vec w) = \frac{ |\vec w^TS_B \vec w|}{ |\vec w^TS_W\vec w|}$

Differentiating $J(\vec w$ ) with respect to $\vec w$ , setting it equal to zero, and rearranging gives,
$\vec w \propto \mathbf{S}^{-1}_W(\mathbf{m}_2-\mathbf{m}_1)$

Multi Class LDA

Thus we have found the direction of $\vec w$ for which our conditions are optimally achieved.

But extending this idea to multiple classes requires us to make some assumptions - so as to make the mathematics a little more easier.

Suppose that now there are $k$ classes. The within-class covariance matrix which is calculated is,
$S_W = S_1 + ... + ... + S_n = \sum_{i = 1}^k \sum_{x\in c_i} (x-\bar x_i)(x-\bar x_i)^T$

While the between-class covariance matrix is given by,
$S_B = \sum_{i=1}^k n_i (\bar x_i - \bar x)(\bar x_i - \bar x)^T$

In the above 2 equations $\bar x$ is the mean of all the data vectors, $\bar x_i$ is the mean of the data vectors in the $i^{th}$ class.

Using the revised covariance matrices and plugging them into the utility function. We will find that the optimal solution for $\vec w$ is,

$S_W^{-1}S_B\vec w = \lambda \vec w$

Optimal $\vec w$ is an eigenvector of $S_W^{-1} S_B$ .

Kernel Trick + LDA = K-LDA

K-LDA the kernelized version of linear discriminant analysis (LDA) (also known as kernel Fisher discriminant analysis or kernel discriminant analysis) is when LDA is implicitly performed in a new feature space, which allows non-linear mappings to be learned.

For most real-world data a linear discriminant (offered by LDA) is not complex enough, because the data vectors might not be linearly separable at all.

We first map the data non-linearly into some feature space $\mathcal{F}$ (a higher dimensional space) and compute LDA there (For different kernel functions these mappings differ). The linear separation which LDA provides in the feature space will yield a non-linear discriminant in the original input space. You can think of a 2 clusters (in the 2d plane) with one of the clusters embedded inside the other, now if we map these vectors to 3 dimensions we can easily introduce a dummy variable (the third axis) such that the 2 clusters are easily separated linearly (by a plane). As you can see below the left side image is the Input space and the right side image is the feature space.

why use feature space

So as we are operating in the feature space, the only change in the LDA explained above is that we use $\phi (x_i)$ instead of $x$ everywhere. Explicitly computing the mappings $\phi(x_i)$ and then performing LDA can be computationally expensive, and in many cases intractable. For example, $\mathcal{F}$ may be infinitely dimensional (as in the case of the gaussian kernel).

So instead of designing a nonlinear transformation and calculating the transformed data vectors in the feature space - $\mathcal{F}$ , we can rewrite the method in terms of dot products and using the kernel trick in which the dot product in the new feature space is replaced by a kernel function.

$k(\mathbf{x},\mathbf{y})=\phi(\mathbf{x})\cdot\phi(\mathbf{y})$

I used this paper and also refered from wikipedia - we can rewrite the equation for $J(\vec w)$ in terms of only dot products,

In the following I am assuming these sizes apply,
$(\mathbf{x}_i^j)_{n \times 1} , (\alpha_i)_{1 \times 1} , (\vec w)_{n \times 1} , (\alpha)_{n_T \times 1} (m_i^{\phi} )_{n \times 1} , (M_i)_{n_T \times 1}$
$((M_i)_j)_{1 \times 1} , (\mathbf{K}_i)_{n_T \times n_i} , M_{n_T \times n_T} , N_{n_T \times n_T}$

The remaining can be understood implicitly, here there are $n$ features, and the projection is onto 1 dimensions. These $n_T$ data vectors are separated into $k$ classes with the $i^{th}$ class having $n_i$ number of data vectors.

First of all $\vec w$ should lie on the span of all training samples $\mathcal{F}$ (read on theory of reproducing kernels to understand why)
$\vec w = \sum_{i=1}^{n_T}\alpha_i\phi(\mathbf{x}_i)$

where $n_T = \sum n_i$ , is the total number of vectors.

Now the dot product of the mean of the $i^{th}$ class (in the feature space) $m_i^{\phi}$ with $\vec w$ will give,

$\vec w^Tm_i^{\phi} = \alpha^T M_i$

Here $\alpha$ is the set of $\alpha_i$ all piled together to form a vector. the mean of each class in the feature space is defined just like how we had discussed in the LDA section.

$m_i^{\phi} = \frac{1}{n_i}\sum_{l=1}^{n_i}\phi(\mathbf{x}_l^i)$

Using the equation for $\vec w$ and $m_i^{\phi}$ we can say,

$\vec w^{\text{T}} m_i^{\phi} = \frac{1}{n_i}\sum_{j=1}^{n}\sum_{l=1}^{n_i}\alpha_jk(\mathbf{x}_j,\mathbf{x}_l^i) = \mathbf{\alpha}^{\text{T}}\mathbf{M}_i$
where $M_i$ is a vector whose $j^{th}$ element is defined as,

$(\mathbf{M}_{i})_j = \frac{1}{n_T}\sum_{l=1}^{n_i}k(\mathbf{x}_j,\mathbf{x}_l^i)$

Here $\mathbf{x}_j$ is the $j^{th}$ data vector among ALL the vectors while $\mathbf{x}_l^i$ is the $l^{th}$ vector in the $i^{th}$ class.

If we define a matrix M as defined below,
$\begin{align} M & = \sum_{i=1}^k n_i(\mathbf{M}_i-\mathbf{M}_{*})(\mathbf{M}_i-\mathbf{M}_{*})^{\text{T}} \\ \end{align}$

Where the $j^{th}$ element in the vector $\mathbf{M}_{*}$ is defined as,

$(\mathbf{M}_{*})_j = \frac{1}{n_T}\sum_{l=1}^{n_T}k(\mathbf{x}_j,\mathbf{x}_l)$

Then the numerator of the utility function can be written as,
$\vec w^T S_B^{\phi} \vec w = \alpha^T M \alpha$

Here $S_B^{\phi}$ is just the between-class covariance matrix defined for the transformed data vectors (in the feature space).

Moving on to the denominator, if we define a matrix $N$ as

$\begin{align} N & = \sum_{i=1}^k\mathbf{K}_i(\mathbf{I}_{n_i \times n_i}-\mathbf{1}_{n_i})\mathbf{K}_i^{\text{T}} \end{align}$

with the $(n^{\text{th}},m^{\text{th}})$ component of $\mathbf{K}_i$ defined as $k(\mathbf{x}_n,\mathbf{x}_m^i)$ , Also here $1_{n_i}$ is the matrix with all entries $\frac{1}{n_i}$ .

Here $\mathbf{K}_i$ is known as the kernel matrix for class $i$ .

Then the denominator of the utility function can be written as,
$\vec w^T S_W^{\phi} \vec w = \alpha^T N \alpha$

Thus using both of the highlighted equations the utility function in the feature space is,

$J(\alpha) = \frac{ |\vec w^TS_B^{\phi} \vec w|}{ |\vec w^TS_W^{\phi}\vec w|} = \frac{|\alpha^T M \alpha|}{|\alpha^T N \alpha|}$

This problem can be solved (just like how LDA was solved in the input space) and the optimal value for $\alpha$ is the leading eigenvector of $N^{-1}M$ .

MATLAB Implementation

The above equations can be implemented in MATLAB as follows. Note there is 2 features and we wish to reduce it a single dimension. There are 4 classes ( $k = 4$ ) each class having 13 vectors ( $n_i \forall i = 13$ ) and so the total number of vectors ( $n_T = 52$ ).

As usual I’ll go ahead and use the Gaussian kernel function. then initialize all the variables.

load TRAINTEST2D

cluster1 = TRAIN{1,6}{1,1}; % Green
cluster2 = TRAIN{1,6}{1,2}; % Blue
cluster3 = TRAIN{1,6}{1,3}; % Red
cluster4 = TRAIN{1,6}{1,4}; % Cyan

nT = 52; % Total number of data Vectors
n1 = 13; % # vectors in cluster 1
n2 = 13; % # vectors in cluster 2
n3 = 13; % # vectors in cluster 3
n4 = 13; % # vectors in cluster 4

% Plot the data before dimensionality reduction
figure(1);
scatter(cluster1(1,:), cluster1(2,:), 'g'); 
hold on;
scatter(cluster2(1,:), cluster2(2,:), 'b');
hold on;
scatter(cluster3(1,:), cluster3(2,:), 'r');
hold on;
scatter(cluster4(1,:), cluster4(2,:), 'c');
hold on;
legend('cluster 1','cluster 2','cluster 3','cluster 4');

Now I’ll construct the Gram Matrix for all the data vectors,

% CONSTRUCT KERNEL MATRIX
gamma = 10;

% Arrange all the clusters together
X = [cluster1';cluster2'; cluster3'; cluster4'];
K = ones(nT,nT);

% Construct the Gram Matrix 
% K(i,:) - gives the ith vector out of ALL vectors
for i = 1:1:nT
    for j = 1:1:nT
        K(i,j) = kernelGauss(X(i,:),X(j,:),gamma);
    end
end

Following the equations given above we can construct M,

% Okay K(i,j) = K(x_i,x_j) Now as X(1:13,:), X(14:26,:), X(27:39,:), 
% X(40:52,:) are the 4 clusters respectively. We can calculate M_i as
% the numbers are like that due to how X was constructed.

M_1 = ones(nT,1);
M_2 = ones(nT,1);
M_3 = ones(nT,1);
M_4 = ones(nT,1);
M_star = ones(nT,1);

% sum(K(1:13,j)) is the sum of k(x,x_j) over all x belonging to cluster 1
% sum(K(14:26,j)) is the sum of k(x,x_j) over all x belonging to cluster 2
% sum(K(27:39,j)) is the sum of k(x,x_j) over all x belonging to cluster 3
% sum(K(40:52,j)) is the sum of k(x,x_j) over all x belonging to cluster 4
% sum(K(:,j)) is the sum of k(x,x_j) over ALL x (all clusters)

for j = 1:1:nT
    M_1(j) = (1/nT)*sum(K(1:13,j));
end
for j = 1:1:nT
    M_2(j) = (1/nT)*sum(K(14:26,j));
end
for j = 1:1:nT
    M_3(j) = (1/nT)*sum(K(27:39,j));
end
for j = 1:1:nT
    M_4(j) = (1/nT)*sum(K(40:52,j));
end

for j = 1:1:nT
    M_star(j) = (1/nT)*sum(K(:,j));
end

% Thus we can construct M
M = n1*(M_1-M_star)*(M_1-M_star)' + n2*(M_2-M_star)*(M_2-M_star)' + n3*(M_3-M_star)*(M_3-M_star)' + n4*(M_4-M_star)*(M_4-M_star)';

For constructing N, we will first separate the gram matrix we had made into the kernel matrices for each of the classes.

% Now we shall construct the kernel matrices for each of the clusters
K_1 = K(:,1:13);
K_2 = K(:,14:26);
K_3 = K(:,27:39);
K_4 = K(:,40:52);

% Thus we can construct N
N = K_1*(eye(n1,n1)-(1/n1)*ones(n1,n1))*K_1' + K_2*(eye(n2,n2)-(1/n2)*ones(n2,n2))*K_2' + K_3*(eye(n3,n3)-(1/n3)*ones(n3,n3))*K_3' + K_4*(eye(n4,n4)-(1/n4)*ones(n4,n4))*K_4';

Now compute $N^{-1}M$ and its eigenvalues - eigenvectors. Note that as N is usually singular we add a multiple of $I$ for inv(N) to become computable.

% Note that in practice, \mathbf{N} is usually singular and 
% so a multiple of the identity is added to it
N = N + 2*eye(nT,nT);
cond(N) % is N ill conditioned? 

% Now that we have both N and M, we can find all the eigenvectors
% and choose most prominent eigenvectors
P = inv(N)*M;
[V,D] = eig(P);

Now we can project the data onto a line using the leading eigenvector which has a eigenvalue of 3.07.

% the first eigenvector has the largest corresponding eigenvalue - 3.07
% after which 0.4, 0.28, 0.26 Then it drops down...
alpha = V(:,1);

% the projected points
y = ones(nT,1);

for i = 1:1:nT
    y(i) = alpha'*K(:,i);
end

projCluster1 = y(1:13);
projCluster2 = y(14:26);
projCluster3 = y(27:39);
projCluster4 = y(40:52);

% Plot the projected data
figure(3);
scatter(projCluster1, zeros(13,1), 'g'); 
hold on;
scatter(projCluster2, zeros(13,1), 'b');
hold on;
scatter(projCluster3, zeros(13,1), 'r');
hold on;
scatter(projCluster4, zeros(13,1), 'c');
hold on;
legend('cluster 1','cluster 2','cluster 3','cluster 4');

The plot we get for gamma (of the kernel function) being 10 is,

After projection

If we are able to try different values for the kernel function we can surely get great separation of classes.

So in conclusion both methods can be viewed as techniques for linear
dimensionality reduction. However, PCA is unsupervised and depends only on the data vectors (maximize its variance using less features) whereas Fisher linear discriminant (KLDA) also uses class-label information to bring the data vectors closer to other data vectors in the same class while maximizing the variance between classes.

Written with StackEdit.

Ideas

Tuesday, April 19, 2016

Dimensionality Reduction

Principal Component Analysis

Linear Discriminant Analysis

Multi Class LDA

Kernel Trick + LDA = K-LDA

MATLAB Implementation

No comments:

Post a Comment