Ideas: 2016

Wednesday, November 30, 2016

Interesting Questions

Question 1

Divide a circle into 6 sectors. Write 1,0,1,0,0,0 in each sector. A move consists of incrementing neighbouring numbers, can we get all numbers in the circle to be equal by a sequence of these moves?

I tried brute force and as intuition seems to suggest, this isn’t possible.

import operator
def splAdd(a,b):
    return(tuple(map(operator.add, a, b)))

def move(n):
    if(sum(n)>10):
        if(n.count(n[0]) == len(n)):
            print("Hooray")
        return
    print(n)
    move(splAdd(n,(1,1,0,0,0,0)))
    move(splAdd(n,(0,1,1,0,0,0)))
    move(splAdd(n,(0,0,1,1,0,0)))
    move(splAdd(n,(0,0,0,1,1,0)))
    move(splAdd(n,(0,0,0,0,1,1)))
    move(splAdd(n,(1,0,0,0,0,1)))
    return

start = (1,0,1,0,0,0)
move(start)

SOLUTION? Why No?

Question 2

Of all the sets of integers whose sum is 1000, what’s the set which has maximum product?

Solution 2

Consider $y = a^b$ , does $a$ or $b$ contribute more to the size of $y$ ? This is what I was first faced with. I could take $500$ $2$ ’s and its sum would be $1000$ . Here the product would be $2^{500}$ . If you thought $b$ contributed more you would go with this.

Say you thought $a$ was more important, in this case $125$ $8$ ’s would sum to $1000$ . Note the product here is $8^{125}$ .

Even though we humans are naturally bad at grasping enormous numbers, We can write $8^{125}$ as ${2^{3}}^{125} = 2^{375}$ which is obviously less than $2^{500}$ .

Ok as some of us might have guessed, its better to multiply more of a smaller number than to multiply a larger number more number of times.

Multiplication is repeated Addition, while Power is repeated Multiplication.

But $2^{500}$ is still not the answer. Let us divide $k$ (a constant) evenly into $\frac{k}{x}$ pieces (each of size $x$ ). So we wish to maximize the product
$\Large x^{\frac{k}{x}}$
is maximum. Note how $\frac{1000}{2} = 500$ and $\frac{1000}{8} = 125$ .

$\Large y = x^{\frac{k}{x}}$
$\large \ln y = \frac{k}{x}\ln x$

To maximize $y$ , we differentiate and equate that to $0$ .

$\frac{1}{y} \frac{dy}{dx}= \frac{-k}{x^2}\ln x + \frac{k}{x^2}$
$\Large \frac{dy}{dx}= (x^{\frac{k}{x}}) \frac{k}{x^2}(1 - \ln x) = 0$

So you can see how no matter what the sum is (doesn’t have to be $1000$ ) the maximum product is when $x = e \$ , But we can’t choose $e$ as it’s not a Integer. So we choose the nearest integer to $2.73$ i.e. $3$ .

So the ideal answer should be full of $3$ ’s. But as $1000\%3 == 1$ we can take $333$ $3$ ’s then we would have to take $1$ as the last integer and this sucks for maximizing product. So $332$ $3$ ’s and $2$ $2$ ’s is optimal as $3\times 1 < 2 \times 2$ .

In the case that $k\%3 == 2$ then we should take $\lfloor \frac{k}{3} \rfloor$ number of $3$ ’s and a single $2$ .

$3^{332}\times2^2$

Question 3

A dragon has $100$ heads, A soldier can cut off $24$ heads but then $15$ will grow back, he can cut off $17$ heads but then $2$ will grow back. He can also cut off $11$ heads but then $20$ will grow back. Will the dragon ever die if the soldier tries various combinations of these moves? If so how?

Solution 3

Move $1$ causes a $(-24+15) = -9$ change in heads.
Move $1$ causes a $(-17+2) = -15$ change in heads.
Move $1$ causes a $(11-20) = 9$ change in heads.

Say we do $k_1$ number of move $1$ , $k_2$ number of move $2$ , and $k_3$ number of move $3$ . The net change is,

$\Delta \text Heads = -9k_1 + -15k_2 + 9k_3$

We want to find integer values for $k_1, k_2, k_3$ such that $\Delta \text Heads = -100$ . Solving for it,

$-9k_1 + -15k_2 + 9k_3 = -100$
$100 + 9(k_3 - k_1) = 15k_2$
$(k_3 - k_1)$ will also be a integer, say $k_0$ .
$k_2= \frac{100}{15} + 9\frac{k_0}{15}$

As a fraction plus a integer can never be a integer, so we can be sure there exists no integers $k_1,k_2, k_3$ such that the above equation is satisfied.

The dragon can never be killed

Question 4

What is the remainder when you divide $n = (174\cdots174)$ (here $174$ is repeated 16 times, $n$ has $48$ digits) with $999$ ?

Solution 4

The first key insight is to remember how the divisibility rule for $9$ (which happens to also be used for $3$ ) was derived. If a decimal number $n$ has the $k$ digits $n_k n_{k-1} \cdots n_1$ then,

$n = n_k(10)^{k-1}+n_{k-1}(10)^{k-2}+ \cdots +n_2(10)^{1}+n_1(10)^{0}$
$n = [n_k(10^{k-1}-1)+ \cdots+n_1(10^{0}-1)] + (n_k + n_{k-1} + \cdots + n_2 + n_1)$

We know that $(10^n - 1)$ is always divisible by $9$ . So if $(n_k + n_{k-1} + \cdots + n_2 + n_1)$ is divisible by $9$ , the whole number is divisible by $9$ .

Coming to this problem, we need to divide with $999$ . Which peculiarly divides $10^n$ as shown below,

$1\times 999 + 1 = 10^3$
$10\times 999 + 10 = 10^4$
$100\times 999 + 100 = 10^5$
$1001\times 999 + 1 = 10^6$
$10010\times 999 + 10 = 10^7$

So it leaves a remainder of $1, 10, 100$ periodically with powers of $10$ , unlike $9$ which was nice enough to only leave $1$ as a reminder - this had helped us isolate just the digits from $n$ .

So our $n$ in terms of its $48$ digits would be,
$n = 1(10)^{47}+7(10)^{46}+4(10)^{45} + \cdots + 4(10)^{3}+1(10)^{2}+7(10)^{1}+4(10)^{0}$

We can’t separate any multiples of $999$ from the first $3$ digits, so let’s write,
$p_1 = 1(10)^{2}+7(10)^{1}+4(10)^{0}$

We can separate digits based on which power of $10$ accompanies them - $10^{3k}, 10^{3k+1}, 10^{3k+2}$ .

$p_2 = 4(10^{45}-1) + \cdots+ 4(10^{6}-1) + 4(10^{3}-1) + (15\times4)$
$p_3 = 7(10^{46}-10) + \cdots+ 7(10^{7}-10) + 7(10^{4}-10) + (15\times70)$
$p_4 = 1(10^{47}-100) + \cdots+ 1(10^{8}-100) + 1(10^{5}-100)+ (15\times100)$

Here we needed to find out how many elements are there in the set $3, 6, \cdots , 45$ . One way is to calculate for $n$ in the AP equation $a_n = a + (n-1)d$ . So $45 = 3 + (n-1)3$ , $n = 15$ .

$n = p_1 + p_2 + p_3 + p_4$
$n \equiv (15\times4) + (43\times70) + (43\times100) + p_1\mod 999$
$n \equiv (16 \times 174) \equiv 786 \mod 999$

We can do $\frac{2784}{999}$ by hand and get the answer $786$ .

The remainder is 786

Question 5

There is a set containing $1, 2, \cdots, 199$ , we proceed to continuously select (at random) $2$ numbers ( $a$ and $b$ ) then compute the number $c = (a+b+ab)$ . We remove both $a$ and $b$ while we add $c$ to this set. Finally what will the final remaining number be?

Solution 5

Note that
$a+b+ab = (a+1)(b+1) - 1$

So now the operation becomes “selecting (at random) $2$ numbers ( $a$ and $b$ ) then compute the number $c = (a+b+ab) = (a+1)(b+1) - 1$ . We remove both $a$ and $b$ while we add $c$ to this set”.

So for $[x_1, x_2, \cdots, x_n]$ , after the first operation it would become
$[(x_1+x_2 + x_1 x_2), x_3, \cdots, x_n]$ which is same as $[((x_1+1)(x_2+1)-1), x_3, \cdots, x_n]$ . So on repeatedly applying this operation we get
$[((x_1+1)(x_2+1)-1), x_3, \cdots, x_n]$
$[((x_1+1)(x_2+1)-1+1)(x_3+1)-1, \cdots, x_n]$
$= [(x_1+1)(x_2+1)(x_3+1)-1, \cdots, x_n]$
$\cdots \ \ \cdots$

So you get where this is going,
$[(x_1+1)(x_2+1) \cdots (x_{n-1}+1)(x_n+1) -1]$

The final term will be
$\left [ \prod_{i=1}^n(x_i+1) \right ]-1$

So for $[x_1, x_2, \cdots, x_n] = [1,2,3, \cdots, n]$ as given in the question means that the final term will be,

$\left [ \prod_{i=1}^n(i+1) \right ]-1 = (n+1)! - 1$

Question 8

Find n, where $n + d(n) + d(d(n)) = 1000$ . $d(n)$ is the digital sum of $n$ .

Digital sum is the digit sum of $n$ . I was baffled by this term, for it was the first time I had heard it.

Solution 8

We know that $n>d(n)>d(d(n))$ So as all $3$ terms are positive, $n$ can be at max a $3$ digit number. Any larger than that and the sum will overshoot $1000$ .

It can’t be a $2$ digit number either as even if $n$ is $99$ the largest $2$ digit number, $n + d(n) + d(d(n)) = 99 + 18 + 9 = 126 < 1000$

So expressing $n$ using it’s digits, $n = [abc] = 100a + 10b + c$ ,

We can see that the sum of 3 digits cannot exceed a 2 digit number, let
$a+b+c = [xy] = 10x + y$

So the given equation will be,

$n + d(n) + d(d(n)) = (100a + 10b + c) + (10x + y) + (x+y) = 1000$
$(100a + 10b + c) + 11x + 2(a+b+c - 10x) = 1000$
$102a + 12b + 3c - 9x = 1000$

Now we can be sure $a<9$ is not possible as even for $n = 899$ , $n + d(n) + d(d(n)) = 899 + 26 + 8 = 933 < 1000$ . So substituting $a=9$ ,

$102\times 9 + 12b + 3c - 9x = 1000$
$3(4b+ c ) = 82 + 9x$
$(4b+ c ) = (\frac{82}{3} + 3x)$
So we can be sure there exists no integers $b,c, x$ such that the above equation is satisfied.

I wrote a code in python to try and brute force the solution,

# To find n such that n + d(n) + d(d(n)) = 1000
def d(i):
    sum = 0
    for j in str(i):
        sum = sum + int(j)
    return(sum)

for i in range(900,1000):
    print("\n\nThis is for "+str(i)+"\n")
    print(i+d(i)+d(d(i)))

Final Answer:

No such $n$ exists

1) 101000 in a circle (6 sectors). Add one to neighbouring numbers, can we get all numbers to be equal.
NO, but why…?

2) Sum of a set of integers is 1000, what’s the maximum possible product of these integers?
$3^{332}\times2^2$

3) Will the dragon die? ( -15, 9, -9 ) ( -24 +15), (-17, +2) , (-20 +11)
No

4) Smallest multiple of 9 with all even digits?
$288$

5) $n$ couples in a village, keep copulating until a boy is born, how does the boy:girl ratio change?
1:1

6) $k = 174\cdots174$ ( 16 times ) % 999 ( what’s k? )
786

7) $p(n) = n^2 - 10n -22 = x\times y$ where $n = 10x + y$
21

8) $n + d(n) + d(d(n)) = 1000$
No $n$ exists such as to satisfy this equation.

9) What’s the output of this program?

def f(n): 

        if(n>0): 

            print("hello") 

        print("world")

$n$ “hello” followed by $n+1$ “world”

10) Find the 100 largest numbers in a set of $10^9$ numbers?
heap ds

11) Find a set of 20 positive integers such that for all (maybe any?) subset of that set the sum of the elements in that subset is a perfect number?
??

12) There is a set containing $1, 2, \cdots, 199$ , remove 2 numbers $a$ and $b$ at random from this set then add the number $(a+b+ab)$ to this set. Finally what number will be remaining?
$((n+1)!-1)$

13) How many 36 digit numbers exist which are not divisible by 100?
$9\times10^{35} - 9\times 10^{33}$

Written with StackEdit.

Sunday, October 23, 2016

Generalized coordinates

In this post I’ll try to explain what Generalized coordinates are, how they came to be and finally where they are used with a simple.

To start with imagine a system of $k$ point masses present in a $d$ dimensional space. The $i^{th}$ point has a position vector $r_i = (x_{i1}, x_{i2}, \cdots, x_{id})$ .

Note that if the system was unconstrained, then as any of the $k$ bodies can occupy any point in the space, we will need $kd$ number of scalars to completely define any particular configuration of the system.

For visualization purposes consider $d=3$ , so we use $3$ scalars (the $x, y, z$ coordinates) to define the position of each point mass. As we have a total of $k$ such masses we need $3k$ scalars.

But often the systems we need to analyse are under some sort of constraints. In case of a robotic arm, each hinge point always maintains a fixed distance between them due to the rigid rod connecting them.

In such situations we don’t need all $kd$ scalars, rather we can use less number of scalars and still manage to uniquely represent all possible states that this constrained system can take. This is because the constraints decrease the degrees of freedom available to that system.

Holonomic constraints

We classify constraints as either being holonomic or nonholonomic.

If a constraint can be expressed purely as a relationship between the position variables ( $r_1, r_2, \cdots, r_k$ ) and time $t$ .
$f(r_1, r_2, \cdots, r_k,\ t)=0,\,$
Then using each such constraint equation we can reduce the number of scalars needed by one. If we knew the time $t$ and all but one position variable, using the constraint equation that missing coordinate can be deduced.

When the constraint has inequalities or higher order derivative terms in them, they are classified as nonholonomic. Note that we have an equality symbol in the above expression. Also the function has to of the position variables themselves, not their derivatives.

All constraints which are unable to be expressed in the above form are called nonholonomic constraints.

Note that if there were velocity terms ( $\dot r_i$ ) in the constraint equation, then integrating that constraint could in some cases yield an equation of the form shown above, if so then it’s a holonomic constraint. So what Wikipedia means when they say that if the constraint has velocity terms they “are not usually holonomic” is that it’s kinda hard to bring it to this form in that case.

When constraints are purely dependent on the positions at any time $t$ its a holonomic constraint, but in a nonholonomic constraint the constraint changes depending on how you reach that position! This seems to be analogous to path and state functions.

To visualise how a holonomic constraint reduces DOF, imagine a point mass is constrained to move along the line $y = mx+c$ . So without this constraint we needed both position variables $(x,y)$ to describe its position anywhere in the XY plane, but now we just need $(x)$ . As if we have $x$ , its position is now known to be $(x,mx+c)$ due to the constraint.

Constrained to move in a circle of radius $r$

So a point mass is moving around in a plane (say the $XY$ plane). We can describe its position at any point of time using the vector,

$\vec r = \left [ \begin{array}{cc} r_1(t) \\ r_2(t) \\ \end{array} \right ]= \left [ \begin{array}{cc} x(t) \\ y(t) \\ \end{array} \right ]$

If we are given that this point mass is constrained to move in a circle of radius $c$ , i.e. the constraint equation would like this,

$|\vec r| = c^2$
$x(t)^2 + y(t)^2 = c^2$

Note how the equation can be written as $x(t)^2 + y(t)^2 - c^2 + 0t = 0$ , i.e. in the form $f(x,y,t)=0$ . Therefore we conclude that this is a holonomic constraint.

So the “generalized coordinate” I will use is $\theta$ , the angle between the $x$ axis and the position vector $\vec r$ .

circle

We know that $x = c \cos \theta$ and $y = c \sin \theta$ .

Let the point mass move a little, it can of course only move in a small arc along the circle due to the constraint imposed on it. Such displacement is called Virtual displacement. In terms of $\theta$ it’s $\delta \theta$ , while in terms of $\vec r$ it’s

$\delta\vec r = \left [ \begin{array}{cc} \delta x \\ \delta y \\ \end{array} \right ]$

The following formula relates this change in the $m$ generalized coordinates ( $\delta q$ ) to that of the change in natural coordinates ( $\delta\vec r$ ).
$\delta\vec r_i=\sum _{{j=1}}^{m}{\frac {\partial {\vec r}_{i}}{\partial q_{j}}}\delta q_{j}\,$

The term ${\frac {\partial {\vec r}_{i}}{\partial q_{j}}}$ is a measure of how much change in ${\vec r}_{i}$ happens for a unit change in $q_{j}$ , keeping all other generalized coordinates $q_i \ \forall \ i \in [1,... , j-1, j+1, ... m]$ constant - that’s what partial differentiation is.

So by multiplying ${\frac {\partial {\vec r}_{i}}{\partial q_{j}}}$ with $\delta q_{j}$ , you get the contributed of the $j_{th}$ generalized coordinate - $\ q_j$ in the displacement $\delta\vec r_i$ .

Summing over all these contributions (from $1$ to $m$ ) we get the net virtual displacement $\delta\vec r_i$ .

Coming back to our example, we can therefore note that

$\delta x = \frac{\partial x}{\partial \theta}\delta \theta = c(-\sin \theta)\delta \theta$

$\delta y = \frac{\partial y}{\partial \theta}\delta \theta = c(\cos \theta)\delta \theta$

Credits to Maschen - for this image, CC0, Link
wikiImage

This makes sense as for small displacement, the $\delta\vec r$ shown in the figure will be a straight line, thus instead of a sector, we have a right angled triangle with $\delta x$ , $\delta y$ and $|\delta\vec r|$ as the sides. $|\delta\vec r|$ will be the length of the arc which is $c \ \delta \theta$ .

So from pythagoras theorem we have $(\delta x)^2 + (\delta y)^2 = |\delta\vec r|^2 = (c \ \delta \theta)^2$ .

And sure enough this fits in with our “derived” relationship between the generalized and natural coordinates.

$(\delta x)^2 + (\delta y)^2 = (c(-\sin \theta)\delta \theta)^2 + (c(\cos \theta)\delta \theta)^2$

$=c^2(\delta \theta)^2((\sin \theta)^2+(\cos \theta)^2) = (c \ \delta \theta)^2$

You can also try doing this with $q = s$ rather than $q = \theta$ like how I showed above. Where $s$ is the length of the curve traced out by the point mass from some reference point.

I thought of writing it up but it turns out that the length of a curve $y=f(x)$ from $(a,f(a)$ to $(b,f(b))$ is $S$ . Which is kinda complicated,

$S = \int_{a}^b\sqrt {1 + (f'(x))^2}$

Well as always let me know if you have any thoughts on my post.

Written with StackEdit.

Friday, September 16, 2016

Combinatorics

We know that we can choose $r$ objects from $n$ distinct objects in ${n\choose r}$ ways. But why?

${n\choose r} = \frac{n!}{r!(n-r)!} = \frac{n\times (n-1)\times (n-2) \times \cdots \times (n-r+1)}{r!}$

This can be thought of as having $r$ empty slots in which we want to arrange the $r$ objects chosen from the $n$ objects.

$\underbrace{(n)\ (n-1) \cdots(n-r+1)}_{r\ \text{slots to fill}}$

For the first slot we have all $n$ objects available to choose from, after which we have $n-1$ objects for the second slot and so on until the last slot for which we only have $(n-(r-1))$ choices as $r-1$ slots have already been filled with objects from the total $n$ objects.

$P_{r}^{n} = (n-0)\times (n-1)\times (n-2) \times \cdots (n-i) \cdots \times (n-(r-1))$

$P_{r}^{n} = \frac{n!}{(n-r)!}$

But here $P_{r}^{n}$ represents the permutations of $n$ objects taken $r$ at a time because we have considered the order to be a distinguishing factor.

But when we just want to choose $r$ objects the order of these $r$ objects should not matter.

So we divide $P_{r}^{n}$ with $r!$ to get ${n\choose r}$ , $\ r!$ because that’s the number of ways we can arrange $r$ distinct objects.

Here I had a doubt, why divide by $r!$ ? how does that work! division $\frac{a}{b}$ can be thought of as asking if $a$ was equally distributed into $b$ boxes, how many would be there in each box. How much larger is $a$ when compared to $b$ ? Or even how many times do I need to subtract $b$ from $a$ to get $0$ .

To understand how this happens I went back to the first time I heard of division being used in this fashion. When I wanted to find all permutations of $n$ objects of which $r$ were identical, it was said to be $\frac{n!}{r!}$ .

In $n!$ we overcounted by a factor of $r!$

To see why $\frac{n!}{r!}$ consider 3 numbered balls, $3!$ is $3\times 2\times 1=6$ , the $6$ permutations for these balls are shown below,

Starts with 1

Starts with 2

Starts with 3

Now its $6$ different arrangements if we order with respect to the number on the ball. But what if we don’t care about the number instead only about the color? Then we can see that we have counted identical arrangements multiple times.

To put this into perspective we need to group the $n!$ i.e. $6$ permutations such that in each group the $r$ identical objects (yellow balls) occupy the same position (you can think of it as each of the columns having the same color).

Group 1 - position one and two

first 2 columns

only counts as,

one

Group 2 - position one and three

first and last

only counts as,

two

Group 3 - position two and three

last 2 columns

only counts as,

enter image description here

We know and can see that each of these groups will have $r!$ i.e. $2!=2\times 1=2$ permutations in them. This is because in each of these groups we can move around the $r$ identical objects in $r!$ ways and yet not change the arrangement.

So here when we divide $n!$ by $r!$ what we are doing is asking how many groups containing $r!$ permutations each exists such that all the arrangements add up to $n!$

Note that in each group the positions held by the $r$ identical objects must be the same.

Now coming back to ${n\choose r}$ , when we calculated $P_{r}^{n}$ for every distinct set of $r$ objects that were chosen from $n$ we counted $r!$ extra permutations. So as order doesnt matter in combinations we can just divide with $r!$ and get,

$\Large {n\choose r} = \frac{P_{r}^{n}}{r!} = \frac{n!}{r!(n-r)!}$

Written with StackEdit.

Friday, April 22, 2016

Face Recognition

We have the data set - ORLFACEDATABASE.mat consisting of 400 images (10 images each of 40 people). Each image is of the size 48 x 48 (grey scale). Each image is reshaped to the size of 2304 x 1 (column vector). Size of the MAT file is therefore 2304 x 400

I will be using LDA (as a discriminative technique) followed by nearest mean to form a classifier.

First initialize all the necessary variables,

% Number of classes
k = 40;
% number of vectors in ith class (for all i \in [1,.. ,k])
ni = 10;
% number of features - pixels
n = 2304;

For each class (person) we can now compute the “mean vector” which is basically the mean image found from the 10 images.

% collection of class means
means = zeros(4,2304);
for i = 1:1:40
    means(i,:) = ((1/10)*sum(C(:,(1+(10*(i-1))):i*10),2))';
end

Here in ‘means’ the ith row is the ith person’s mean image vector
we pass 2 as argument to sum in order for it to add column wise
By default it adds row wise (also passing one does that). Its better to have each person’s mean vector as a column vector - like how C is arranged.

means = means';

Now we can also find the dataset mean - the average human image which I shall call meanStar


% Total mean (all data vectors)
meanStar = (1/400)*sum(C(:,:),2);

As discussed for assignment 2, I will employ multi class LDA and find $S_b$ - the scatter in between class matrix.

% Construct Sb - between class Covariance Matrix
Sb = zeros(n,n);
for i = 1:1:k
    Sb = Sb + ni*(means(:,i) - meanStar)*(means(:,i) - meanStar)';
end

Similarly we can construct $S_w$ - the scatter within class matrix,

% Construct Sw - within class Covariance matrix
Sw = zeros(n,n);
for i = 1:1:k
    % Takes the values of the within class C.M for each of the classes
    Stemp = zeros(n,n);
    % The 10 vectors of the ith class arranged as columns
    classVectors = C(:,(1+(10*(i-1))):i*10);

    for j = 1:1:ni
        Stemp = Stemp + (double(classVectors(:,j)) - means(:,j))*(double(classVectors(:,j)) - means(:,j))';
    end
    Sw = Sw + Stemp;
end

In practice, $S_w$ is often singular since the data are image vectors with large
dimensionality while the size of the data set is much smaller. Running cond on the obtained $S_w$ shows us how ill-conditioned the matrix is.

To alleviate this problem, we can perform two projections:
1) PCA is first applied to the data set to reduce its dimensionality.
2) LDA is then applied to further reduce the dimensionality.

But for now I am using discrete inverse theory, according to which adding a small value c to the diagonal of the matrix A about to be inverted, is called damping the inversion and the small value to be added c is called Marquardt-Levenberg coefficient. Sometimes matrix A has zero or close to zero eigenvalues, due to which the matrix becomes singular; adding a small damping coefficient to the diagonal elements makes it stable. Bigger is the value of c, bigger is the damping, your matrix inversion is more stable,

Stemp = Sw/(10^6) + 2*eye(n,n);
cond(Sw)

Now that we have got both $S_w$ and $S_b$ , we can find the optimal value of $\vec w$ and use it to project the data to a lower dimension.

% the first eigen vector
w = V(:,1);

% the projected points
y = ones((k*ni),1);

for i = 1:1:(k*ni)
    y(i) = real(w)'*double(C(:,i));
end

I just plotted all the points below,

enter image description here

Well I was unable to find the code to plot with 40 different colors, so for now all the points look the same, but each point represent a image. Now using this we can define boundaries for each person and use that to discriminate which person a new image is.

Written with StackEdit.

Tuesday, April 19, 2016

Dimensionality Reduction

Dimensionality reduction is a method of reducing the number of random variables under consideration via obtaining a set “uncorrelated” principle variables.

If we have $n$ features (each of them are the random variables we just talked about) then each data sample will be a $n$ dimensional vector in the $\mathcal{R}^n$ feature space. when $n$ is a large number, patterns in data can be hard to find as graphical representation is not possible.

What if we are able to identify a set of “principal variables” which are transformations of the existing $n$ variables such that we don’t lose much information.

PCA is a famous example when such data transformation are linear, there are other non linear methods too.

Here I’ll be now talking about PCA (Principal Component Analysis) and K-LDA (kernelized version of linear discriminant analysis (LDA)] ) and using these techniques in order to reduce the 2 dimensional data in this dataset into a single dimensions.

Principal Component Analysis

PCA is the orthogonal projection of the data onto a lower dimensional linear space. such that the variance of the projected data is maximized.

Variance is spread of the data, if the data is more spread, the better it is for us as we can easily separate them into clusters etc. We lack information if all the data vectors are localized around the mean.

So to perform PCA, we calculate the directions (in the $n$ dimensional feature space) in which variance is maximum (say $m$ such directions). Using these directions as basis vectors we can now project all the data vectors onto this subspace.

Here $m<n$ , and we choose the directions by calculating the eigenvectors of the covariance matrix. The eigenvectors corresponding to the larger eigenvalues will be the directions where the variance varies more.

Now I will be doing the above method in MATLAB, as part of assignment as I am in ICE department I am assigned the 6th dataset - TRAIN{1,6} so as you can see my - has 4 clusters -

So we can load the data set and initialize the clusters. After which we can plot them as a scatter graph.

load TRAINTEST2D

cluster1 = TRAIN{1,6}{1,1}; % Green
cluster2 = TRAIN{1,6}{1,2}; % Blue
cluster3 = TRAIN{1,6}{1,3}; % Red
cluster4 = TRAIN{1,6}{1,4}; % Cyan


scatter(cluster1(1,:), cluster1(2,:), 'g'); 
hold on;
scatter(cluster2(1,:), cluster2(2,:), 'b');
hold on;
scatter(cluster3(1,:), cluster3(2,:), 'r');
hold on;
scatter(cluster4(1,:), cluster4(2,:), 'c');
hold on;

DataSet initial

So we have data comprising a set of observations of $p = 2$ variables. This data can be arranged as a set of $n = 52$ data vectors $\mathbf{x}_1 \ldots \mathbf{x}_n$ with each $\mathbf{x}_i$ representing a single grouped observation of the p variables.
Here 4 clusters with 13 data vectors each, contributed to form $13 \times 4 = 52$ . Write $\mathbf{x}_1 \ldots \mathbf{x}_n$ as row vectors, each of which has p columns. Place these row vectors into a single matrix X of dimensions $n \times p$ .
Now we center our data at the origin. This is supposed to help in reducing the Mean Square Error when we approximate the data. This answer also provides some insight into how it’s necessary to compute the covariance matrix. Note that that after centering at origin, if you execute sum(B) you will get a very small number close to 0. So we can conclude the data is successfully centered.
But the images before and after centreing don’t look that different for this dataset as the empirical mean is $-0.1502 , -0.0465$ the entire data set just shifts as a whole fractionally.

% number of data vectors - 13*4
n = 52;
% number of dimensions (before reduction)
p = 2; 
% arrange data vectors as rows
X = [cluster1';cluster2'; cluster3'; cluster4'];
% find both x and y's mean - empirical mean vector
u = (1/n)*sum(X);
% subtract from the mean - Thus achieving centering
B = X - ones(52,1)*u;


% Plot the data after centering it around the Origin
figure(2);
scatter(B(1:13,1), B(1:13,2), 'g'); 
hold on;
scatter(B(14:26,1), B(14:26,2), 'b');
hold on;
scatter(B(27:39,1), B(27:39,2), 'r');
hold on;
scatter(B(40:52,1), B(40:52,2), 'c');
hold on;
legend('cluster 1','cluster 2','cluster 3','cluster 4');

dataset after centreing

Next I will find the empirical covariance matrix from the outer product of matrix B with itself.

C = (B'*B)/(n-1);

Then we can find the eigenvectors of the covariance matrix, choose the one which has a larger eigenvalue (In this data set $1.2>1.01$ ). Now we can project all the 2D points on the line (in the direction of that particular eigenvector).

% we get a column vector of eigen values - the first will be the one with
% the largest magnitude
E = svds(C);
% V will have the eigen vectors
[V,D] = eig(C);

% Thus the axis along which variance is Maximum is
maxVarianceAxis = V(:,2);

% No need to divide by (maxVarianceAxis'*maxVarianceAxis) as
% eig generates orthonormal eigenvectors
projectionMatrix = (maxVarianceAxis*maxVarianceAxis');

projCluster1 = projectionMatrix*cluster1;
projCluster2 = projectionMatrix*cluster2;
projCluster3 = projectionMatrix*cluster3;
projCluster4 = projectionMatrix*cluster4;

% Plot the data after projecting onto the principle component
figure(3);
scatter(projCluster1(1,:), projCluster1(2,:), 'g'); 
hold on;
scatter(projCluster2(1,:), projCluster2(2,:), 'b');
hold on;
scatter(projCluster3(1,:), projCluster3(2,:), 'r');
hold on;
scatter(projCluster4(1,:), projCluster4(2,:), 'c');
hold on;
legend('cluster 1','cluster 2','cluster 3','cluster 4');

After projecting onto principle component axis

As we can see the separation of blue and green clusters is good, but the light blue cluster is getting mixed up with the red cluster.

Before I wrap up, I’d like to show how the projection these data vectors onto the other eigenvector looks like,

enter image description here

See how the red and light blue is well separated here (But not as well separated green and the dark blue in the projection onto the eigenvector corresponding to the largest eigenvalue. The variance shown here is only slightly larger compared to the answer. (because the eigenvalues are close - about 0.2 difference) .

So here we had 2 features which we reduced to a single feature (the position of the data vector on the line - a single value if expressed as the distance from the origin). Thus we successfully implemented PCA and extracted a abstract feature where the data has maximum variance - globally thereby reducing dimensionality.

This leads us to LDA, where we no longer look at this same dataset as $52$ data vectors rather we take into consideration that that they are $13$ vectors from $4$ different classes.

Linear Discriminant Analysis

This is a method to find a linear combination of features that characterizes or separates two or more classes of objects.

This method is also known as Fisher’s linear discriminant, the idea is that we don’t wish to project such that the overall variance is maximized, rather we want the the clusters to be spread out (maximize the between-class covariance matrix after projection) while the data vectors of a particular class should be closer to the mean of that class (within-class covariance matrix for all the classes should be minimized).

The idea is that this enables us to easily discriminate new data vectors into classes. The more clearly separated the classes are, the less ambiguity we face when classifying.

I’ll explain using 2 classes and then extend it to multiple classes (in this dataset we are using 4 classes). let there be 2 classes - $C_1, C_2$ , having $n_1, n_2$ number of vectors respectively. $x_{ij}$ is the $i^{th}$ vector in class $j$ . So the mean of the classes will be,
$m_1 = \frac{1}{n_1}\sum_{i = 1}^{n_1}x_{i1}$
$m_2 = \frac{1}{n_2}\sum_{i = 1}^{n_2}x_{i2}$

the projection from $n$ to $1$ dimensions is done using $\vec w$ ,
$y_{ij} = \vec w^Tx_{ij}$
where the sizes are,
$1 \times 1 = (1 \times n) \times (n \times 1)$

The simplest measure of the separation of the classes, when projected along the line defined by $\vec w$ , is the square of the separation between the projected class means. If the mean after projection of class 1 and 2 are - $\bar y_1$ and $\bar y_2$ then,
$\bar y_2 - \bar y_1 = \vec w^Tm_2 - \vec w^Tm_1 = \vec w^T(m_2 - m_1)$
$(\bar y_2 - \bar y_1)^2 = \vec w^T(m_2 - m_1)(m_2 - m_1)^T\vec w = \vec w^TS_B\vec w$

Here we define the between-class covariance matrix $S_B$ as,
$S_B = (m_2 - m_1)(m_2 - m_1)^T$

Thus $\vec w^TS_B\vec w$ is to be maximized.

While we minimize the sum of the within-class variance’s for each of the classes - $\sum_{j=1}^{k} \sigma^2_j$ .

The within-class variance of the transformed data for the class $C_j$ is given by,
$\sigma^2_j = \sum_{i = 1}^{n_j}(y_{ij} - \bar y_j)^2 = \sum_{i = 1}^{n_j}(\vec w^Tx_{ij} - \vec w^Tm_j)^2$
$\sigma^2_j = \sum_{i =1}^{n_j} \vec w^T(x_{ij} - m_j)(x_{ij} - m_j)^T \vec w$
So after taking $\vec w$ terms out of the sigma and rewriting the summation in terms of the total within-class covariance matrix - $S_W$ ,

$S_W = \sum_{i = 1}^{n_1} (x_{i1} - m_1)(x_{i1} - m_1)^T + \sum_{i = 1}^{n_2} (x_{i2} - m_2)(x_{i2} - m_2)^T$

For 2 classes the measure of how close the data vectors are to thier respective classes mean is $\sigma^2_1 +\sigma^2_2$

$\sigma^2_1 +\sigma^2_2 = \vec w^T S_W \vec w$

Thus $\vec w^TS_W\vec w$ is to be minimized.

So with this we can construct the utility function which we need to maximize,

$J(\vec w) = \frac{ |\vec w^TS_B \vec w|}{ |\vec w^TS_W\vec w|}$

Differentiating $J(\vec w$ ) with respect to $\vec w$ , setting it equal to zero, and rearranging gives,
$\vec w \propto \mathbf{S}^{-1}_W(\mathbf{m}_2-\mathbf{m}_1)$

Multi Class LDA

Thus we have found the direction of $\vec w$ for which our conditions are optimally achieved.

But extending this idea to multiple classes requires us to make some assumptions - so as to make the mathematics a little more easier.

Suppose that now there are $k$ classes. The within-class covariance matrix which is calculated is,
$S_W = S_1 + ... + ... + S_n = \sum_{i = 1}^k \sum_{x\in c_i} (x-\bar x_i)(x-\bar x_i)^T$

While the between-class covariance matrix is given by,
$S_B = \sum_{i=1}^k n_i (\bar x_i - \bar x)(\bar x_i - \bar x)^T$

In the above 2 equations $\bar x$ is the mean of all the data vectors, $\bar x_i$ is the mean of the data vectors in the $i^{th}$ class.

Using the revised covariance matrices and plugging them into the utility function. We will find that the optimal solution for $\vec w$ is,

$S_W^{-1}S_B\vec w = \lambda \vec w$

Optimal $\vec w$ is an eigenvector of $S_W^{-1} S_B$ .

Kernel Trick + LDA = K-LDA

K-LDA the kernelized version of linear discriminant analysis (LDA) (also known as kernel Fisher discriminant analysis or kernel discriminant analysis) is when LDA is implicitly performed in a new feature space, which allows non-linear mappings to be learned.

For most real-world data a linear discriminant (offered by LDA) is not complex enough, because the data vectors might not be linearly separable at all.

We first map the data non-linearly into some feature space $\mathcal{F}$ (a higher dimensional space) and compute LDA there (For different kernel functions these mappings differ). The linear separation which LDA provides in the feature space will yield a non-linear discriminant in the original input space. You can think of a 2 clusters (in the 2d plane) with one of the clusters embedded inside the other, now if we map these vectors to 3 dimensions we can easily introduce a dummy variable (the third axis) such that the 2 clusters are easily separated linearly (by a plane). As you can see below the left side image is the Input space and the right side image is the feature space.

why use feature space

So as we are operating in the feature space, the only change in the LDA explained above is that we use $\phi (x_i)$ instead of $x$ everywhere. Explicitly computing the mappings $\phi(x_i)$ and then performing LDA can be computationally expensive, and in many cases intractable. For example, $\mathcal{F}$ may be infinitely dimensional (as in the case of the gaussian kernel).

So instead of designing a nonlinear transformation and calculating the transformed data vectors in the feature space - $\mathcal{F}$ , we can rewrite the method in terms of dot products and using the kernel trick in which the dot product in the new feature space is replaced by a kernel function.

$k(\mathbf{x},\mathbf{y})=\phi(\mathbf{x})\cdot\phi(\mathbf{y})$

I used this paper and also refered from wikipedia - we can rewrite the equation for $J(\vec w)$ in terms of only dot products,

In the following I am assuming these sizes apply,
$(\mathbf{x}_i^j)_{n \times 1} , (\alpha_i)_{1 \times 1} , (\vec w)_{n \times 1} , (\alpha)_{n_T \times 1} (m_i^{\phi} )_{n \times 1} , (M_i)_{n_T \times 1}$
$((M_i)_j)_{1 \times 1} , (\mathbf{K}_i)_{n_T \times n_i} , M_{n_T \times n_T} , N_{n_T \times n_T}$

The remaining can be understood implicitly, here there are $n$ features, and the projection is onto 1 dimensions. These $n_T$ data vectors are separated into $k$ classes with the $i^{th}$ class having $n_i$ number of data vectors.

First of all $\vec w$ should lie on the span of all training samples $\mathcal{F}$ (read on theory of reproducing kernels to understand why)
$\vec w = \sum_{i=1}^{n_T}\alpha_i\phi(\mathbf{x}_i)$

where $n_T = \sum n_i$ , is the total number of vectors.

Now the dot product of the mean of the $i^{th}$ class (in the feature space) $m_i^{\phi}$ with $\vec w$ will give,

$\vec w^Tm_i^{\phi} = \alpha^T M_i$

Here $\alpha$ is the set of $\alpha_i$ all piled together to form a vector. the mean of each class in the feature space is defined just like how we had discussed in the LDA section.

$m_i^{\phi} = \frac{1}{n_i}\sum_{l=1}^{n_i}\phi(\mathbf{x}_l^i)$

Using the equation for $\vec w$ and $m_i^{\phi}$ we can say,

$\vec w^{\text{T}} m_i^{\phi} = \frac{1}{n_i}\sum_{j=1}^{n}\sum_{l=1}^{n_i}\alpha_jk(\mathbf{x}_j,\mathbf{x}_l^i) = \mathbf{\alpha}^{\text{T}}\mathbf{M}_i$
where $M_i$ is a vector whose $j^{th}$ element is defined as,

$(\mathbf{M}_{i})_j = \frac{1}{n_T}\sum_{l=1}^{n_i}k(\mathbf{x}_j,\mathbf{x}_l^i)$

Here $\mathbf{x}_j$ is the $j^{th}$ data vector among ALL the vectors while $\mathbf{x}_l^i$ is the $l^{th}$ vector in the $i^{th}$ class.

If we define a matrix M as defined below,
$\begin{align} M & = \sum_{i=1}^k n_i(\mathbf{M}_i-\mathbf{M}_{*})(\mathbf{M}_i-\mathbf{M}_{*})^{\text{T}} \\ \end{align}$

Where the $j^{th}$ element in the vector $\mathbf{M}_{*}$ is defined as,

$(\mathbf{M}_{*})_j = \frac{1}{n_T}\sum_{l=1}^{n_T}k(\mathbf{x}_j,\mathbf{x}_l)$

Then the numerator of the utility function can be written as,
$\vec w^T S_B^{\phi} \vec w = \alpha^T M \alpha$

Here $S_B^{\phi}$ is just the between-class covariance matrix defined for the transformed data vectors (in the feature space).

Moving on to the denominator, if we define a matrix $N$ as

$\begin{align} N & = \sum_{i=1}^k\mathbf{K}_i(\mathbf{I}_{n_i \times n_i}-\mathbf{1}_{n_i})\mathbf{K}_i^{\text{T}} \end{align}$

with the $(n^{\text{th}},m^{\text{th}})$ component of $\mathbf{K}_i$ defined as $k(\mathbf{x}_n,\mathbf{x}_m^i)$ , Also here $1_{n_i}$ is the matrix with all entries $\frac{1}{n_i}$ .

Here $\mathbf{K}_i$ is known as the kernel matrix for class $i$ .

Then the denominator of the utility function can be written as,
$\vec w^T S_W^{\phi} \vec w = \alpha^T N \alpha$

Thus using both of the highlighted equations the utility function in the feature space is,

$J(\alpha) = \frac{ |\vec w^TS_B^{\phi} \vec w|}{ |\vec w^TS_W^{\phi}\vec w|} = \frac{|\alpha^T M \alpha|}{|\alpha^T N \alpha|}$

This problem can be solved (just like how LDA was solved in the input space) and the optimal value for $\alpha$ is the leading eigenvector of $N^{-1}M$ .

MATLAB Implementation

The above equations can be implemented in MATLAB as follows. Note there is 2 features and we wish to reduce it a single dimension. There are 4 classes ( $k = 4$ ) each class having 13 vectors ( $n_i \forall i = 13$ ) and so the total number of vectors ( $n_T = 52$ ).

As usual I’ll go ahead and use the Gaussian kernel function. then initialize all the variables.

load TRAINTEST2D

cluster1 = TRAIN{1,6}{1,1}; % Green
cluster2 = TRAIN{1,6}{1,2}; % Blue
cluster3 = TRAIN{1,6}{1,3}; % Red
cluster4 = TRAIN{1,6}{1,4}; % Cyan

nT = 52; % Total number of data Vectors
n1 = 13; % # vectors in cluster 1
n2 = 13; % # vectors in cluster 2
n3 = 13; % # vectors in cluster 3
n4 = 13; % # vectors in cluster 4

% Plot the data before dimensionality reduction
figure(1);
scatter(cluster1(1,:), cluster1(2,:), 'g'); 
hold on;
scatter(cluster2(1,:), cluster2(2,:), 'b');
hold on;
scatter(cluster3(1,:), cluster3(2,:), 'r');
hold on;
scatter(cluster4(1,:), cluster4(2,:), 'c');
hold on;
legend('cluster 1','cluster 2','cluster 3','cluster 4');

Now I’ll construct the Gram Matrix for all the data vectors,

% CONSTRUCT KERNEL MATRIX
gamma = 10;

% Arrange all the clusters together
X = [cluster1';cluster2'; cluster3'; cluster4'];
K = ones(nT,nT);

% Construct the Gram Matrix 
% K(i,:) - gives the ith vector out of ALL vectors
for i = 1:1:nT
    for j = 1:1:nT
        K(i,j) = kernelGauss(X(i,:),X(j,:),gamma);
    end
end

Following the equations given above we can construct M,

% Okay K(i,j) = K(x_i,x_j) Now as X(1:13,:), X(14:26,:), X(27:39,:), 
% X(40:52,:) are the 4 clusters respectively. We can calculate M_i as
% the numbers are like that due to how X was constructed.

M_1 = ones(nT,1);
M_2 = ones(nT,1);
M_3 = ones(nT,1);
M_4 = ones(nT,1);
M_star = ones(nT,1);

% sum(K(1:13,j)) is the sum of k(x,x_j) over all x belonging to cluster 1
% sum(K(14:26,j)) is the sum of k(x,x_j) over all x belonging to cluster 2
% sum(K(27:39,j)) is the sum of k(x,x_j) over all x belonging to cluster 3
% sum(K(40:52,j)) is the sum of k(x,x_j) over all x belonging to cluster 4
% sum(K(:,j)) is the sum of k(x,x_j) over ALL x (all clusters)

for j = 1:1:nT
    M_1(j) = (1/nT)*sum(K(1:13,j));
end
for j = 1:1:nT
    M_2(j) = (1/nT)*sum(K(14:26,j));
end
for j = 1:1:nT
    M_3(j) = (1/nT)*sum(K(27:39,j));
end
for j = 1:1:nT
    M_4(j) = (1/nT)*sum(K(40:52,j));
end

for j = 1:1:nT
    M_star(j) = (1/nT)*sum(K(:,j));
end

% Thus we can construct M
M = n1*(M_1-M_star)*(M_1-M_star)' + n2*(M_2-M_star)*(M_2-M_star)' + n3*(M_3-M_star)*(M_3-M_star)' + n4*(M_4-M_star)*(M_4-M_star)';

For constructing N, we will first separate the gram matrix we had made into the kernel matrices for each of the classes.

% Now we shall construct the kernel matrices for each of the clusters
K_1 = K(:,1:13);
K_2 = K(:,14:26);
K_3 = K(:,27:39);
K_4 = K(:,40:52);

% Thus we can construct N
N = K_1*(eye(n1,n1)-(1/n1)*ones(n1,n1))*K_1' + K_2*(eye(n2,n2)-(1/n2)*ones(n2,n2))*K_2' + K_3*(eye(n3,n3)-(1/n3)*ones(n3,n3))*K_3' + K_4*(eye(n4,n4)-(1/n4)*ones(n4,n4))*K_4';

Now compute $N^{-1}M$ and its eigenvalues - eigenvectors. Note that as N is usually singular we add a multiple of $I$ for inv(N) to become computable.

% Note that in practice, \mathbf{N} is usually singular and 
% so a multiple of the identity is added to it
N = N + 2*eye(nT,nT);
cond(N) % is N ill conditioned? 

% Now that we have both N and M, we can find all the eigenvectors
% and choose most prominent eigenvectors
P = inv(N)*M;
[V,D] = eig(P);

Now we can project the data onto a line using the leading eigenvector which has a eigenvalue of 3.07.

% the first eigenvector has the largest corresponding eigenvalue - 3.07
% after which 0.4, 0.28, 0.26 Then it drops down...
alpha = V(:,1);

% the projected points
y = ones(nT,1);

for i = 1:1:nT
    y(i) = alpha'*K(:,i);
end

projCluster1 = y(1:13);
projCluster2 = y(14:26);
projCluster3 = y(27:39);
projCluster4 = y(40:52);

% Plot the projected data
figure(3);
scatter(projCluster1, zeros(13,1), 'g'); 
hold on;
scatter(projCluster2, zeros(13,1), 'b');
hold on;
scatter(projCluster3, zeros(13,1), 'r');
hold on;
scatter(projCluster4, zeros(13,1), 'c');
hold on;
legend('cluster 1','cluster 2','cluster 3','cluster 4');

The plot we get for gamma (of the kernel function) being 10 is,

After projection

If we are able to try different values for the kernel function we can surely get great separation of classes.

So in conclusion both methods can be viewed as techniques for linear
dimensionality reduction. However, PCA is unsupervised and depends only on the data vectors (maximize its variance using less features) whereas Fisher linear discriminant (KLDA) also uses class-label information to bring the data vectors closer to other data vectors in the same class while maximizing the variance between classes.

Written with StackEdit.

Wednesday, November 30, 2016

Interesting Questions

Question 1

Question 2

Solution 2

Question 3

Solution 3

Question 4

Solution 4

Question 5

Solution 5

Question 8

Solution 8

Sunday, October 23, 2016

Generalized coordinates

Holonomic constraints

Constrained to move in a circle of radius r

Friday, September 16, 2016

Combinatorics

Combinatorics

In n! we overcounted by a factor of r!

Group 1 - position one and two

Group 2 - position one and three

Group 3 - position two and three

Friday, April 22, 2016

Face Recognition

Tuesday, April 19, 2016

Dimensionality Reduction

Principal Component Analysis

Linear Discriminant Analysis

Multi Class LDA

Kernel Trick + LDA = K-LDA

MATLAB Implementation

Constrained to move in a circle of radius $r$

In $n!$ we overcounted by a factor of $r!$