Thursday, April 14, 2016

Equivalent Kernel

The prediction of the output for a input is,

here the function is called smoother matrix or the equivalent kernel.

Note that we are substituting the solution for which is from when we maximize the posterior p.d.f

This effective kernel defines the weights by which the training
set target values are linearly combined in order to make a prediction for a new value of x.

Here scalar can be thought of as a measure of how important the th input-output vector pair ( the mapping in the Training DataSet) is; for the prediction of the output corresponding to input .

For example in the gaussian kernel, the closer is to , the larger the magnitude of the kernel function. Thus those corresponding will be weighted higher.

This is visually shown below. when we work with an actual dataset.

You can read more about this in the wiki page for Kernel Smoother.

Kernel Matrix

A Kernel Matrix - operates on the training set and produces predictions, is the number of input vectors (, and there are data samples {} in the training dataset.

Here the predictions for the input vectors will be a vector - ,

If we wish to form a “model” of sorts (which gives us predictions) instead of laboriously computing the kernels every time we wish to make a prediction. We can form the kernel and thus for any new input we can find the closest . In order to save memory space and even ensure accuracy we can set a resolution and operate on normalized data.

So if we consider a resolution of 0.01 and the normalization range to be [0,1]. Then the input vectors ( will be if we consider the input to be 2 dimensional.

So even though it will take a long time to compute the entire matrix and even experiment with various kernel parameter values to finally obtain the kernel matrix. But once we have it we can quickly finish predicting for new inputs.

Now I’ll use the above concepts to form such a model using this dataset. Find the whole code here, please ensure you have kernel Functions defined in separate files and added to MATLAB’s path.

% The exponential kernel is closely related to the Gaussian kernel, with only the 
% square of the norm left out. It is also a radial basis function kernel.
function [res] = kernelExp(X1, X2, c)
    res = exp( -(norm(X1-X2)) * c );
  • First load the data into the workspace,

load('ASSIGNMENT1.mat');

  • Separate DATA into the training dataset and testing dataset.
% Training DataSet - 7,500 data samples
inputTrain = DATA(1:7500,1:2); % each row is a 1*2 input vector
outputTrain = DATA(1:7500,3:4); % each row is a 1*2 target vector

% Testing DataSet
inputTest = DATA(7501:10000,1:2);
outputTest = DATA(7501:10000,3:4);
  • Now depending on how the input is normalized (in our case it’s 0 to 1), get the parameters initialized. The larger the resolution the longer the forming of the model will take, but the model will serve to predict a larger number of inputs with great accuracy (we won’t have to round off to the closest). I am preallocating the modelInputsmatrix (as when an array is growing by assignment or concatenation it affects the code performance).
r = 0.1; % Resolution
% The number of X_o's (as input is 1*2) and normalized to [0,1]
M = int8( ((1/r)+1)*((1/r)+1) ); 
N = 7500; % 75% of the total 10,000 samples are used as the training set.
modelInputs = zeros(M,2);
NormalisedInputs = 0:r:1;
  • Construct the set of input vectors using NormalisedInputs, by taking all pairwise combinations of it (as we need a 1*2 vector).
for i = 1:1:length(NormalisedInputs)
    for j = 1:1:length(NormalisedInputs)
        %modelInputs = [modelInputs; [NormalisedInputs(i),NormalisedInputs(j)]];
        modelInputs(j+length(NormalisedInputs)*(i-1),1:2) = [NormalisedInputs(i),NormalisedInputs(j)];
    end
end
  • For a particular value of gamma now we can construct the matrix,
K = zeros(M,N);
gamma = 0.1;

for i = 1:1:M
    for j = 1:1:N
        K(i,j) = kernelGauss(modelInputs(i,1:2),X_i(j,1:2),gamma);
    end
end

They are the following,

K_gauss = zeros(M,N);
K_sigmoid = zeros(M,N);
gamma = 10;
alpha = 1;
c = 1;

for i = 1:1:M
    for j = 1:1:N
        K_gauss(i,j) = kernelGauss(modelInputs(i,1:2),X_i(j,1:2),gamma);
        K_sigmoid(i,j) = kernelHyperTangent(modelInputs(i,1:2),X_i(j,1:2),alpha,c);
    end
end
  • At this point I would like to show visually how the weighing of the training set works, as you can see each row of the matrix corresponds to a single input while the columns are for each of the training samples. So if you run the following code,

Note that I tried for about 2 days to find the MATLAB syntax to represent the X axis as the first input parameter, Y axis as the second input parameter. So when we take say resolution of 0.1. then that would be the scale of the X and Y axis. Now using Z axis we can represent the weight we attribute to that input. So it would be a scatter plot with (x,y,z) where (x,y) is the input and z is its corresponding weight. In this way it would be clearly seen as a hill how the inputs similar to the training input under consideration gets high weightage.

The following code displays the weights corresponding to all 121 input vectors from (0,0) to (1,1) (when we took resolution = 0.1) for the Pth training sample. Here M is 121 cause that many input vectors exist from 0 to 1 with a resolution of 0.1

plot(K_gauss(1:M,P));

When P = 98 (the index of the training sample is 98) with training input - (0.396, 0.495), we get the following plot.

weight distribution

See how we have 5 significant peaks in the weights? if we had plotted it in 3d we could have seen all these peaks were actually close together.
The largest peak is at index 50 which happens to be the input vector (0.4,0.5). Note how similar it is to the training input we are comparing with.
The next 2 peaks are at 39 and 61, which are (0.3,0.5) and (0.5,0.5). So we see a unit resolution change also counts as close.
The final 2 are 28 and 72 which are (0.2,0.5) and (0.6,0.5).

So this shows us how is larger the closer is to .

Now the question arises how much larger should the weights be, if they are close, rather how should the weights vary as the distance between them - changes. This is controlled using the parameter, I used a value of 50 and obtained the above graph. Note that a larger gamma, means that more emphasis on the closeness. So for , you will get a single peak at the the 50th index.

impulse lol

This much must have got you a great idea on how this gaussian kernel works as an instance based learner.

Ok so for a particular training sample (i.e. ) we plotted the weight values (corresponding to each of the input vectors - 121 of them) and showed how the input vectors which are “closer” to that particular training sample’s input is given a larger weight.

Cool so to predict the output for a particular input vector - we use all the weights and weigh ALL the training sample’s outputs.

But first we need to normalize the weights to 1 as we use these weights to only contribute the relative importance of the various training set’s outputs.

% This vector's ith element is the sum of weights used to predict the 
% output for the ith input vector.
sumOfWeights = zeros([M,1]);
for i = 1:1:M
    sumOfWeights(i) = sum(K_gauss(i,1:N));
end

predictions = K_gauss * outputTrain;
% Normalize it
predictions(1:M,1) = predictions(1:M,1)./ sumOfWeights;
predictions(1:M,2) = predictions(1:M,2)./ sumOfWeights;

Moving on to testing this model, we can calculate the MSE when we compare the results of the prediction to the actual outputs.

% Now we are going to test this model - calculate MSE error
L = 10000 - N; % remmaining samples in the data set are used to test
mse = 0;

for i = 1:1:L
    % in this iteration we wish to find the output for inputTest(i,1:2)
    modelInput = inputTest(i,1:2);

    % First we need to find which index this input is closest to
    % in modelInputs
    X1 = round(modelInput(1), numberOfDecimals);
    X2 = round(modelInput(2), numberOfDecimals);

    index1 = find(modelInputs(1:M,1) == X1);
    index2 = find(modelInputs(1:M,2) == X2);
    index = intersect(index1,index2);

    % so the corresponding prediction will be
    modelOutput = predictions(index,1:2);
    actualOuput = outputTest(i,1:2);

    squaredError = (modelOutput - actualOuput)*(modelOutput - actualOuput)';
    mse = mse + squaredError;
end

% as its the Mean we gotta divide with number of samples.
mse = mse/L;

With using the gaussian kernel we get a mean square error of which is reasonable.

Written with StackEdit.

No comments:

Post a Comment