Wednesday, November 30, 2016

Interesting Questions

Question 1

Divide a circle into 6 sectors. Write 1,0,1,0,0,0 in each sector. A move consists of incrementing neighbouring numbers, can we get all numbers in the circle to be equal by a sequence of these moves?

I tried brute force and as intuition seems to suggest, this isn’t possible.

import operator
def splAdd(a,b):
    return(tuple(map(operator.add, a, b)))

def move(n):
    if(sum(n)>10):
        if(n.count(n[0]) == len(n)):
            print("Hooray")
        return
    print(n)
    move(splAdd(n,(1,1,0,0,0,0)))
    move(splAdd(n,(0,1,1,0,0,0)))
    move(splAdd(n,(0,0,1,1,0,0)))
    move(splAdd(n,(0,0,0,1,1,0)))
    move(splAdd(n,(0,0,0,0,1,1)))
    move(splAdd(n,(1,0,0,0,0,1)))
    return

start = (1,0,1,0,0,0)
move(start)

SOLUTION? Why No?

Question 2

Of all the sets of integers whose sum is 1000, what’s the set which has maximum product?

Solution 2

Consider , does or contribute more to the size of ? This is what I was first faced with. I could take ’s and its sum would be . Here the product would be . If you thought contributed more you would go with this.

Say you thought was more important, in this case ’s would sum to . Note the product here is .

Even though we humans are naturally bad at grasping enormous numbers, We can write as which is obviously less than .

Ok as some of us might have guessed, its better to multiply more of a smaller number than to multiply a larger number more number of times.

Multiplication is repeated Addition, while Power is repeated Multiplication.

But is still not the answer. Let us divide (a constant) evenly into pieces (each of size ). So we wish to maximize the product

is maximum. Note how and .


To maximize , we differentiate and equate that to .


So you can see how no matter what the sum is (doesn’t have to be ) the maximum product is when , But we can’t choose as it’s not a Integer. So we choose the nearest integer to i.e. .

So the ideal answer should be full of ’s. But as we can take ’s then we would have to take as the last integer and this sucks for maximizing product. So ’s and ’s is optimal as .

In the case that then we should take number of’s and a single .

Question 3

A dragon has heads, A soldier can cut off heads but then will grow back, he can cut off heads but then will grow back. He can also cut off heads but then will grow back. Will the dragon ever die if the soldier tries various combinations of these moves? If so how?

Solution 3

Move causes a change in heads.
Move causes a change in heads.
Move causes a change in heads.

Say we do number of move , number of move , and number of move . The net change is,

We want to find integer values for such that . Solving for it,



will also be a integer, say .

As a fraction plus a integer can never be a integer, so we can be sure there exists no integers such that the above equation is satisfied.

The dragon can never be killed

Question 4

What is the remainder when you divide (here is repeated 16 times, has digits) with ?

Solution 4

The first key insight is to remember how the divisibility rule for (which happens to also be used for ) was derived. If a decimal number has the digits then,


We know that is always divisible by . So if is divisible by , the whole number is divisible by .

Coming to this problem, we need to divide with . Which peculiarly divides as shown below,





So it leaves a remainder of periodically with powers of , unlike which was nice enough to only leave as a reminder - this had helped us isolate just the digits from .

So our in terms of its digits would be,

We can’t separate any multiples of from the first digits, so let’s write,

We can separate digits based on which power of accompanies them - .



Here we needed to find out how many elements are there in the set . One way is to calculate for in the AP equation . So , .



We can do by hand and get the answer .

The remainder is 786

Question 5

There is a set containing , we proceed to continuously select (at random) numbers ( and ) then compute the number . We remove both and while we add to this set. Finally what will the final remaining number be?

Solution 5

Note that

So now the operation becomes “selecting (at random) numbers ( and ) then compute the number . We remove both and while we add to this set”.

So for , after the first operation it would become
which is same as . So on repeatedly applying this operation we get



So you get where this is going,

The final term will be

So for as given in the question means that the final term will be,

Question 8

Find n, where . is the digital sum of .

Digital sum is the digit sum of . I was baffled by this term, for it was the first time I had heard it.

Solution 8

We know that So as all terms are positive, can be at max a digit number. Any larger than that and the sum will overshoot .

It can’t be a digit number either as even if is the largest digit number,

So expressing using it’s digits, ,

We can see that the sum of 3 digits cannot exceed a 2 digit number, let

So the given equation will be,



Now we can be sure is not possible as even for , . So substituting ,




So we can be sure there exists no integers such that the above equation is satisfied.

I wrote a code in python to try and brute force the solution,

# To find n such that n + d(n) + d(d(n)) = 1000
def d(i):
    sum = 0
    for j in str(i):
        sum = sum + int(j)
    return(sum)

for i in range(900,1000):
    print("\n\nThis is for "+str(i)+"\n")
    print(i+d(i)+d(d(i)))

Final Answer:

No such exists


1) 101000 in a circle (6 sectors). Add one to neighbouring numbers, can we get all numbers to be equal.
NO, but why…?

2) Sum of a set of integers is 1000, what’s the maximum possible product of these integers?

3) Will the dragon die? ( -15, 9, -9 ) ( -24 +15), (-17, +2) , (-20 +11)
No

4) Smallest multiple of 9 with all even digits?

5) couples in a village, keep copulating until a boy is born, how does the boy:girl ratio change?
1:1

6) ( 16 times ) % 999 ( what’s k? )
786

7) where
21

8)
No exists such as to satisfy this equation.

9) What’s the output of this program?

def f(n): 
if(n>0):
print("hello")
print("world")

“hello” followed by “world”

10) Find the 100 largest numbers in a set of numbers?
heap ds

11) Find a set of 20 positive integers such that for all (maybe any?) subset of that set the sum of the elements in that subset is a perfect number?
??

12) There is a set containing , remove 2 numbers and at random from this set then add the number to this set. Finally what number will be remaining?

13) How many 36 digit numbers exist which are not divisible by 100?

Written with StackEdit.

Sunday, October 23, 2016

Generalized coordinates

In this post I’ll try to explain what Generalized coordinates are, how they came to be and finally where they are used with a simple.

To start with imagine a system of point masses present in a dimensional space. The point has a position vector .

Note that if the system was unconstrained, then as any of the bodies can occupy any point in the space, we will need number of scalars to completely define any particular configuration of the system.

For visualization purposes consider , so we use scalars (the coordinates) to define the position of each point mass. As we have a total of such masses we need scalars.

But often the systems we need to analyse are under some sort of constraints. In case of a robotic arm, each hinge point always maintains a fixed distance between them due to the rigid rod connecting them.

In such situations we don’t need all scalars, rather we can use less number of scalars and still manage to uniquely represent all possible states that this constrained system can take. This is because the constraints decrease the degrees of freedom available to that system.

Holonomic constraints

We classify constraints as either being holonomic or nonholonomic.

If a constraint can be expressed purely as a relationship between the position variables () and time .

Then using each such constraint equation we can reduce the number of scalars needed by one. If we knew the time and all but one position variable, using the constraint equation that missing coordinate can be deduced.

When the constraint has inequalities or higher order derivative terms in them, they are classified as nonholonomic. Note that we have an equality symbol in the above expression. Also the function has to of the position variables themselves, not their derivatives.

All constraints which are unable to be expressed in the above form are called nonholonomic constraints.

Note that if there were velocity terms () in the constraint equation, then integrating that constraint could in some cases yield an equation of the form shown above, if so then it’s a holonomic constraint. So what Wikipedia means when they say that if the constraint has velocity terms they “are not usually holonomic” is that it’s kinda hard to bring it to this form in that case.

When constraints are purely dependent on the positions at any time its a holonomic constraint, but in a nonholonomic constraint the constraint changes depending on how you reach that position! This seems to be analogous to path and state functions.

To visualise how a holonomic constraint reduces DOF, imagine a point mass is constrained to move along the line . So without this constraint we needed both position variables to describe its position anywhere in the XY plane, but now we just need . As if we have , its position is now known to be due to the constraint.

Constrained to move in a circle of radius

So a point mass is moving around in a plane (say the plane). We can describe its position at any point of time using the vector,

If we are given that this point mass is constrained to move in a circle of radius , i.e. the constraint equation would like this,


Note how the equation can be written as , i.e. in the form . Therefore we conclude that this is a holonomic constraint.

So the “generalized coordinate” I will use is , the angle between the axis and the position vector .

circle

We know that and .

Let the point mass move a little, it can of course only move in a small arc along the circle due to the constraint imposed on it. Such displacement is called Virtual displacement. In terms of it’s , while in terms of it’s

The following formula relates this change in the generalized coordinates () to that of the change in natural coordinates ().

The term is a measure of how much change in happens for a unit change in , keeping all other generalized coordinates constant - that’s what partial differentiation is.

So by multiplying with , you get the contributed of the generalized coordinate - in the displacement .

Summing over all these contributions (from to ) we get the net virtual displacement .

Coming back to our example, we can therefore note that

Credits to Maschen - for this image, CC0, Link
wikiImage

This makes sense as for small displacement, the shown in the figure will be a straight line, thus instead of a sector, we have a right angled triangle with , and as the sides. will be the length of the arc which is .

So from pythagoras theorem we have .

And sure enough this fits in with our “derived” relationship between the generalized and natural coordinates.

You can also try doing this with rather than like how I showed above. Where is the length of the curve traced out by the point mass from some reference point.

I thought of writing it up but it turns out that the length of a curve from to is . Which is kinda complicated,

Well as always let me know if you have any thoughts on my post.

Written with StackEdit.

Friday, September 16, 2016

Combinatorics

Combinatorics

We know that we can choose objects from distinct objects in ways. But why?

This can be thought of as having empty slots in which we want to arrange the objects chosen from the objects.

For the first slot we have all objects available to choose from, after which we have objects for the second slot and so on until the last slot for which we only have choices as slots have already been filled with objects from the total objects.

But here represents the permutations of objects taken at a time because we have considered the order to be a distinguishing factor.

But when we just want to choose objects the order of these objects should not matter.

So we divide with to get , because that’s the number of ways we can arrange distinct objects.

Here I had a doubt, why divide by ? how does that work! division can be thought of as asking if was equally distributed into boxes, how many would be there in each box. How much larger is when compared to ? Or even how many times do I need to subtract from to get .

To understand how this happens I went back to the first time I heard of division being used in this fashion. When I wanted to find all permutations of objects of which were identical, it was said to be .

In we overcounted by a factor of

To see why consider 3 numbered balls, is , the permutations for these balls are shown below,

Starts with 1

Starts with 2

Starts with 3

Now its different arrangements if we order with respect to the number on the ball. But what if we don’t care about the number instead only about the color? Then we can see that we have counted identical arrangements multiple times.

To put this into perspective we need to group the i.e. permutations such that in each group the identical objects (yellow balls) occupy the same position (you can think of it as each of the columns having the same color).

Group 1 - position one and two

first 2 columns

only counts as,

one

Group 2 - position one and three

first and last

only counts as,

two

Group 3 - position two and three

last 2 columns

only counts as,

enter image description here

We know and can see that each of these groups will have i.e. permutations in them. This is because in each of these groups we can move around the identical objects in ways and yet not change the arrangement.

So here when we divide by what we are doing is asking how many groups containing permutations each exists such that all the arrangements add up to

Note that in each group the positions held by the identical objects must be the same.

Now coming back to , when we calculated for every distinct set of objects that were chosen from we counted extra permutations. So as order doesnt matter in combinations we can just divide with and get,

Written with StackEdit.

Friday, April 22, 2016

Face Recognition

We have the data set - ORLFACEDATABASE.mat consisting of 400 images (10 images each of 40 people). Each image is of the size 48 x 48 (grey scale). Each image is reshaped to the size of 2304 x 1 (column vector). Size of the MAT file is therefore 2304 x 400

I will be using LDA (as a discriminative technique) followed by nearest mean to form a classifier.

  • First initialize all the necessary variables,
% Number of classes
k = 40;
% number of vectors in ith class (for all i \in [1,.. ,k])
ni = 10;
% number of features - pixels
n = 2304;
  • For each class (person) we can now compute the “mean vector” which is basically the mean image found from the 10 images.
% collection of class means
means = zeros(4,2304);
for i = 1:1:40
    means(i,:) = ((1/10)*sum(C(:,(1+(10*(i-1))):i*10),2))';
end
  • Here in ‘means’ the ith row is the ith person’s mean image vector
    we pass 2 as argument to sum in order for it to add column wise
    By default it adds row wise (also passing one does that). Its better to have each person’s mean vector as a column vector - like how C is arranged.
means = means';
  • Now we can also find the dataset mean - the average human image which I shall call meanStar

% Total mean (all data vectors)
meanStar = (1/400)*sum(C(:,:),2);
  • As discussed for assignment 2, I will employ multi class LDA and find - the scatter in between class matrix.
% Construct Sb - between class Covariance Matrix
Sb = zeros(n,n);
for i = 1:1:k
    Sb = Sb + ni*(means(:,i) - meanStar)*(means(:,i) - meanStar)';
end
  • Similarly we can construct - the scatter within class matrix,
% Construct Sw - within class Covariance matrix
Sw = zeros(n,n);
for i = 1:1:k
    % Takes the values of the within class C.M for each of the classes
    Stemp = zeros(n,n);
    % The 10 vectors of the ith class arranged as columns
    classVectors = C(:,(1+(10*(i-1))):i*10);

    for j = 1:1:ni
        Stemp = Stemp + (double(classVectors(:,j)) - means(:,j))*(double(classVectors(:,j)) - means(:,j))';
    end
    Sw = Sw + Stemp;
end

In practice, is often singular since the data are image vectors with large
dimensionality while the size of the data set is much smaller. Running cond on the obtained shows us how ill-conditioned the matrix is.

To alleviate this problem, we can perform two projections:
1) PCA is first applied to the data set to reduce its dimensionality.
2) LDA is then applied to further reduce the dimensionality.

But for now I am using discrete inverse theory, according to which adding a small value c to the diagonal of the matrix A about to be inverted, is called damping the inversion and the small value to be added c is called Marquardt-Levenberg coefficient. Sometimes matrix A has zero or close to zero eigenvalues, due to which the matrix becomes singular; adding a small damping coefficient to the diagonal elements makes it stable. Bigger is the value of c, bigger is the damping, your matrix inversion is more stable,

Stemp = Sw/(10^6) + 2*eye(n,n);
cond(Sw)
  • Now that we have got both and , we can find the optimal value of and use it to project the data to a lower dimension.
% the first eigen vector
w = V(:,1);

% the projected points
y = ones((k*ni),1);

for i = 1:1:(k*ni)
    y(i) = real(w)'*double(C(:,i));
end
  • I just plotted all the points below,

enter image description here

Well I was unable to find the code to plot with 40 different colors, so for now all the points look the same, but each point represent a image. Now using this we can define boundaries for each person and use that to discriminate which person a new image is.

Written with StackEdit.

Tuesday, April 19, 2016

Dimensionality Reduction

Dimensionality reduction is a method of reducing the number of random variables under consideration via obtaining a set “uncorrelated” principle variables.

If we have features (each of them are the random variables we just talked about) then each data sample will be a dimensional vector in the feature space. when is a large number, patterns in data can be hard to find as graphical representation is not possible.

What if we are able to identify a set of “principal variables” which are transformations of the existing variables such that we don’t lose much information.

PCA is a famous example when such data transformation are linear, there are other non linear methods too.

Here I’ll be now talking about PCA (Principal Component Analysis) and K-LDA (kernelized version of linear discriminant analysis (LDA)] ) and using these techniques in order to reduce the 2 dimensional data in this dataset into a single dimensions.

Principal Component Analysis

PCA is the orthogonal projection of the data onto a lower dimensional linear space. such that the variance of the projected data is maximized.

Variance is spread of the data, if the data is more spread, the better it is for us as we can easily separate them into clusters etc. We lack information if all the data vectors are localized around the mean.

So to perform PCA, we calculate the directions (in the dimensional feature space) in which variance is maximum (say such directions). Using these directions as basis vectors we can now project all the data vectors onto this subspace.

Here , and we choose the directions by calculating the eigenvectors of the covariance matrix. The eigenvectors corresponding to the larger eigenvalues will be the directions where the variance varies more.

Now I will be doing the above method in MATLAB, as part of assignment as I am in ICE department I am assigned the 6th dataset - TRAIN{1,6} so as you can see my - has 4 clusters -

  • So we can load the data set and initialize the clusters. After which we can plot them as a scatter graph.
load TRAINTEST2D

cluster1 = TRAIN{1,6}{1,1}; % Green
cluster2 = TRAIN{1,6}{1,2}; % Blue
cluster3 = TRAIN{1,6}{1,3}; % Red
cluster4 = TRAIN{1,6}{1,4}; % Cyan


scatter(cluster1(1,:), cluster1(2,:), 'g'); 
hold on;
scatter(cluster2(1,:), cluster2(2,:), 'b');
hold on;
scatter(cluster3(1,:), cluster3(2,:), 'r');
hold on;
scatter(cluster4(1,:), cluster4(2,:), 'c');
hold on;

DataSet initial

  • So we have data comprising a set of observations of variables. This data can be arranged as a set of data vectors with each representing a single grouped observation of the p variables.
    Here 4 clusters with 13 data vectors each, contributed to form . Write as row vectors, each of which has p columns. Place these row vectors into a single matrix X of dimensions .
  • Now we center our data at the origin. This is supposed to help in reducing the Mean Square Error when we approximate the data. This answer also provides some insight into how it’s necessary to compute the covariance matrix. Note that that after centering at origin, if you execute sum(B) you will get a very small number close to 0. So we can conclude the data is successfully centered.
    But the images before and after centreing don’t look that different for this dataset as the empirical mean is the entire data set just shifts as a whole fractionally.
% number of data vectors - 13*4
n = 52;
% number of dimensions (before reduction)
p = 2; 
% arrange data vectors as rows
X = [cluster1';cluster2'; cluster3'; cluster4'];
% find both x and y's mean - empirical mean vector
u = (1/n)*sum(X);
% subtract from the mean - Thus achieving centering
B = X - ones(52,1)*u;


% Plot the data after centering it around the Origin
figure(2);
scatter(B(1:13,1), B(1:13,2), 'g'); 
hold on;
scatter(B(14:26,1), B(14:26,2), 'b');
hold on;
scatter(B(27:39,1), B(27:39,2), 'r');
hold on;
scatter(B(40:52,1), B(40:52,2), 'c');
hold on;
legend('cluster 1','cluster 2','cluster 3','cluster 4');

dataset after centreing

  • Next I will find the empirical covariance matrix from the outer product of matrix B with itself.
C = (B'*B)/(n-1);
  • Then we can find the eigenvectors of the covariance matrix, choose the one which has a larger eigenvalue (In this data set ). Now we can project all the 2D points on the line (in the direction of that particular eigenvector).
% we get a column vector of eigen values - the first will be the one with
% the largest magnitude
E = svds(C);
% V will have the eigen vectors
[V,D] = eig(C);

% Thus the axis along which variance is Maximum is
maxVarianceAxis = V(:,2);

% No need to divide by (maxVarianceAxis'*maxVarianceAxis) as
% eig generates orthonormal eigenvectors
projectionMatrix = (maxVarianceAxis*maxVarianceAxis');

projCluster1 = projectionMatrix*cluster1;
projCluster2 = projectionMatrix*cluster2;
projCluster3 = projectionMatrix*cluster3;
projCluster4 = projectionMatrix*cluster4;

% Plot the data after projecting onto the principle component
figure(3);
scatter(projCluster1(1,:), projCluster1(2,:), 'g'); 
hold on;
scatter(projCluster2(1,:), projCluster2(2,:), 'b');
hold on;
scatter(projCluster3(1,:), projCluster3(2,:), 'r');
hold on;
scatter(projCluster4(1,:), projCluster4(2,:), 'c');
hold on;
legend('cluster 1','cluster 2','cluster 3','cluster 4');

After projecting onto principle component axis

As we can see the separation of blue and green clusters is good, but the light blue cluster is getting mixed up with the red cluster.

Before I wrap up, I’d like to show how the projection these data vectors onto the other eigenvector looks like,

enter image description here

See how the red and light blue is well separated here (But not as well separated green and the dark blue in the projection onto the eigenvector corresponding to the largest eigenvalue. The variance shown here is only slightly larger compared to the answer. (because the eigenvalues are close - about 0.2 difference) .

So here we had 2 features which we reduced to a single feature (the position of the data vector on the line - a single value if expressed as the distance from the origin). Thus we successfully implemented PCA and extracted a abstract feature where the data has maximum variance - globally thereby reducing dimensionality.

This leads us to LDA, where we no longer look at this same dataset as data vectors rather we take into consideration that that they are vectors from different classes.

Linear Discriminant Analysis

This is a method to find a linear combination of features that characterizes or separates two or more classes of objects.

This method is also known as Fisher’s linear discriminant, the idea is that we don’t wish to project such that the overall variance is maximized, rather we want the the clusters to be spread out (maximize the between-class covariance matrix after projection) while the data vectors of a particular class should be closer to the mean of that class (within-class covariance matrix for all the classes should be minimized).

The idea is that this enables us to easily discriminate new data vectors into classes. The more clearly separated the classes are, the less ambiguity we face when classifying.

I’ll explain using 2 classes and then extend it to multiple classes (in this dataset we are using 4 classes). let there be 2 classes - , having number of vectors respectively. is the vector in class . So the mean of the classes will be,

the projection from to dimensions is done using ,

where the sizes are,

The simplest measure of the separation of the classes, when projected along the line defined by , is the square of the separation between the projected class means. If the mean after projection of class 1 and 2 are - and then,

Here we define the between-class covariance matrix as,

Thus is to be maximized.

While we minimize the sum of the within-class variance’s for each of the classes - .

The within-class variance of the transformed data for the class is given by,


So after taking terms out of the sigma and rewriting the summation in terms of the total within-class covariance matrix - ,

For 2 classes the measure of how close the data vectors are to thier respective classes mean is

Thus is to be minimized.

So with this we can construct the utility function which we need to maximize,

Differentiating ) with respect to , setting it equal to zero, and rearranging gives,

Multi Class LDA

Thus we have found the direction of for which our conditions are optimally achieved.

But extending this idea to multiple classes requires us to make some assumptions - so as to make the mathematics a little more easier.

Suppose that now there are classes. The within-class covariance matrix which is calculated is,

While the between-class covariance matrix is given by,

In the above 2 equations is the mean of all the data vectors, is the mean of the data vectors in the class.

Using the revised covariance matrices and plugging them into the utility function. We will find that the optimal solution for is,

Optimal is an eigenvector of .

Kernel Trick + LDA = K-LDA

K-LDA the kernelized version of linear discriminant analysis (LDA) (also known as kernel Fisher discriminant analysis or kernel discriminant analysis) is when LDA is implicitly performed in a new feature space, which allows non-linear mappings to be learned.

For most real-world data a linear discriminant (offered by LDA) is not complex enough, because the data vectors might not be linearly separable at all.

We first map the data non-linearly into some feature space (a higher dimensional space) and compute LDA there (For different kernel functions these mappings differ). The linear separation which LDA provides in the feature space will yield a non-linear discriminant in the original input space. You can think of a 2 clusters (in the 2d plane) with one of the clusters embedded inside the other, now if we map these vectors to 3 dimensions we can easily introduce a dummy variable (the third axis) such that the 2 clusters are easily separated linearly (by a plane). As you can see below the left side image is the Input space and the right side image is the feature space.

why use feature space

So as we are operating in the feature space, the only change in the LDA explained above is that we use instead of everywhere. Explicitly computing the mappings and then performing LDA can be computationally expensive, and in many cases intractable. For example, may be infinitely dimensional (as in the case of the gaussian kernel).

So instead of designing a nonlinear transformation and calculating the transformed data vectors in the feature space - , we can rewrite the method in terms of dot products and using the kernel trick in which the dot product in the new feature space is replaced by a kernel function.

I used this paper and also refered from wikipedia - we can rewrite the equation for in terms of only dot products,

In the following I am assuming these sizes apply,

The remaining can be understood implicitly, here there are features, and the projection is onto 1 dimensions. These data vectors are separated into classes with the class having number of data vectors.

First of all should lie on the span of all training samples (read on theory of reproducing kernels to understand why)

where , is the total number of vectors.

Now the dot product of the mean of the class (in the feature space) with will give,

Here is the set of all piled together to form a vector. the mean of each class in the feature space is defined just like how we had discussed in the LDA section.

Using the equation for and we can say,


where is a vector whose element is defined as,

Here is the data vector among ALL the vectors while is the vector in the class.

If we define a matrix M as defined below,

Where the element in the vector is defined as,

Then the numerator of the utility function can be written as,

Here is just the between-class covariance matrix defined for the transformed data vectors (in the feature space).

Moving on to the denominator, if we define a matrix as

with the component of defined as , Also here is the matrix with all entries .

Here is known as the kernel matrix for class .

Then the denominator of the utility function can be written as,

Thus using both of the highlighted equations the utility function in the feature space is,

This problem can be solved (just like how LDA was solved in the input space) and the optimal value for is the leading eigenvector of .

MATLAB Implementation

The above equations can be implemented in MATLAB as follows. Note there is 2 features and we wish to reduce it a single dimension. There are 4 classes () each class having 13 vectors () and so the total number of vectors ().

load TRAINTEST2D

cluster1 = TRAIN{1,6}{1,1}; % Green
cluster2 = TRAIN{1,6}{1,2}; % Blue
cluster3 = TRAIN{1,6}{1,3}; % Red
cluster4 = TRAIN{1,6}{1,4}; % Cyan

nT = 52; % Total number of data Vectors
n1 = 13; % # vectors in cluster 1
n2 = 13; % # vectors in cluster 2
n3 = 13; % # vectors in cluster 3
n4 = 13; % # vectors in cluster 4

% Plot the data before dimensionality reduction
figure(1);
scatter(cluster1(1,:), cluster1(2,:), 'g'); 
hold on;
scatter(cluster2(1,:), cluster2(2,:), 'b');
hold on;
scatter(cluster3(1,:), cluster3(2,:), 'r');
hold on;
scatter(cluster4(1,:), cluster4(2,:), 'c');
hold on;
legend('cluster 1','cluster 2','cluster 3','cluster 4');
  • Now I’ll construct the Gram Matrix for all the data vectors,
% CONSTRUCT KERNEL MATRIX
gamma = 10;

% Arrange all the clusters together
X = [cluster1';cluster2'; cluster3'; cluster4'];
K = ones(nT,nT);

% Construct the Gram Matrix 
% K(i,:) - gives the ith vector out of ALL vectors
for i = 1:1:nT
    for j = 1:1:nT
        K(i,j) = kernelGauss(X(i,:),X(j,:),gamma);
    end
end
  • Following the equations given above we can construct M,
% Okay K(i,j) = K(x_i,x_j) Now as X(1:13,:), X(14:26,:), X(27:39,:), 
% X(40:52,:) are the 4 clusters respectively. We can calculate M_i as
% the numbers are like that due to how X was constructed.

M_1 = ones(nT,1);
M_2 = ones(nT,1);
M_3 = ones(nT,1);
M_4 = ones(nT,1);
M_star = ones(nT,1);

% sum(K(1:13,j)) is the sum of k(x,x_j) over all x belonging to cluster 1
% sum(K(14:26,j)) is the sum of k(x,x_j) over all x belonging to cluster 2
% sum(K(27:39,j)) is the sum of k(x,x_j) over all x belonging to cluster 3
% sum(K(40:52,j)) is the sum of k(x,x_j) over all x belonging to cluster 4
% sum(K(:,j)) is the sum of k(x,x_j) over ALL x (all clusters)

for j = 1:1:nT
    M_1(j) = (1/nT)*sum(K(1:13,j));
end
for j = 1:1:nT
    M_2(j) = (1/nT)*sum(K(14:26,j));
end
for j = 1:1:nT
    M_3(j) = (1/nT)*sum(K(27:39,j));
end
for j = 1:1:nT
    M_4(j) = (1/nT)*sum(K(40:52,j));
end

for j = 1:1:nT
    M_star(j) = (1/nT)*sum(K(:,j));
end

% Thus we can construct M
M = n1*(M_1-M_star)*(M_1-M_star)' + n2*(M_2-M_star)*(M_2-M_star)' + n3*(M_3-M_star)*(M_3-M_star)' + n4*(M_4-M_star)*(M_4-M_star)';
  • For constructing N, we will first separate the gram matrix we had made into the kernel matrices for each of the classes.
% Now we shall construct the kernel matrices for each of the clusters
K_1 = K(:,1:13);
K_2 = K(:,14:26);
K_3 = K(:,27:39);
K_4 = K(:,40:52);

% Thus we can construct N
N = K_1*(eye(n1,n1)-(1/n1)*ones(n1,n1))*K_1' + K_2*(eye(n2,n2)-(1/n2)*ones(n2,n2))*K_2' + K_3*(eye(n3,n3)-(1/n3)*ones(n3,n3))*K_3' + K_4*(eye(n4,n4)-(1/n4)*ones(n4,n4))*K_4';
  • Now compute and its eigenvalues - eigenvectors. Note that as N is usually singular we add a multiple of for inv(N) to become computable.
% Note that in practice, \mathbf{N} is usually singular and 
% so a multiple of the identity is added to it
N = N + 2*eye(nT,nT);
cond(N) % is N ill conditioned? 

% Now that we have both N and M, we can find all the eigenvectors
% and choose most prominent eigenvectors
P = inv(N)*M;
[V,D] = eig(P);
  • Now we can project the data onto a line using the leading eigenvector which has a eigenvalue of 3.07.
% the first eigenvector has the largest corresponding eigenvalue - 3.07
% after which 0.4, 0.28, 0.26 Then it drops down...
alpha = V(:,1);

% the projected points
y = ones(nT,1);

for i = 1:1:nT
    y(i) = alpha'*K(:,i);
end

projCluster1 = y(1:13);
projCluster2 = y(14:26);
projCluster3 = y(27:39);
projCluster4 = y(40:52);

% Plot the projected data
figure(3);
scatter(projCluster1, zeros(13,1), 'g'); 
hold on;
scatter(projCluster2, zeros(13,1), 'b');
hold on;
scatter(projCluster3, zeros(13,1), 'r');
hold on;
scatter(projCluster4, zeros(13,1), 'c');
hold on;
legend('cluster 1','cluster 2','cluster 3','cluster 4');

The plot we get for gamma (of the kernel function) being 10 is,

After projection

If we are able to try different values for the kernel function we can surely get great separation of classes.

So in conclusion both methods can be viewed as techniques for linear
dimensionality reduction. However, PCA is unsupervised and depends only on the data vectors (maximize its variance using less features) whereas Fisher linear discriminant (KLDA) also uses class-label information to bring the data vectors closer to other data vectors in the same class while maximizing the variance between classes.

Written with StackEdit.