Prepare data for SVM, Is it valid to normalise the data before and after PCA dimension reduction

Is it valid to normalise a dataset, reduce dimensionality with PCA and then to normalise the reduced dimension data. Assuming this is performed on training data, should the same PCA coefficients be used to reduce the dimension of the test data. Should the same max and min normalisation values be used for the test and training data. I have included a simplified example of the code I am using which may describe I said better. Thanks in advance.

%% Prepare Training Data


% Normalise training data
mindata=min(TRAINDATA); maxdata=max(TRAINDATA);
TRAINDATA = ((TRAINDATA-repmat(mindata,[size(TRAINDATA,1),1]))./(repmat(maxdata,[size(TRAINDATA,1),1])-repmat(mindata,[size(TRAINDATA,1),1])) - 0.5 ) *2;

% Perform PCA
mTRAINDATA = mean(mean(TRAINDATA));
TRAINDATA = TRAINDATA - mTRAINDATA;
[Cpca,~,~,~,~]=princomp(TRAINDATA,'econ');
EigenRange = 1:2;
Cpca = Cpca(:,EigenRange);
TRAINDATA = TRAINDATA*Cpca;
TRAINDATA = TRAINDATA + mTRAINDATA;

% Normalise training data second time
mindata2=min(TRAINDATA); maxdata2=max(TRAINDATA);
TRAINDATA = ((TRAINDATA-repmat(mindata2,[size(TRAINDATA,1),1]))./(repmat(maxdata2,[size(TRAINDATA,1),1])-repmat(mindata2,[size(TRAINDATA,1),1])) - 0.5 ) *2;



%% Prepare Test Data

% Normalise using first normalisation values from training data
TESTDATA = ((TESTDATA-repmat(mindata,[size(TESTDATA,1),1]))./(repmat(maxdata,[size(TESTDATA,1),1])-repmat(mindata,[size(TESTDATA,1),1])) - 0.5 ) *2;

% Perform PCA
mTESTDATA = mean(mean(TESTDATA));
TESTDATA = TESTDATA - mTESTDATA;
TESTDATA = TESTDATA*Cpca;
TESTDATA = TESTDATA + mTRAINDATA;

% Normalise using second normalisation values from training data
TESTDATA = ((TESTDATA-repmat(mindata2,[size(TESTDATA,1),1]))./(repmat(maxdata2,[size(TESTDATA,1),1])-repmat(mindata2,[size(TESTDATA,1),1])) - 0.5 ) *2;

Topic svm dimensionality-reduction libsvm machine-learning

Category Data Science


As far as PCA components are concerned you should use the same number of PCA for test data. The logic behind this is that same transformation should happen to test data that happened to train data. I am assuming here that your train and test data are independently drawn.

In fact, it is important to normalize your data before applying PCA. There is a science behind it where it calculates the importance of a variable and if you won't normalize, it will give high weightage to values which are numerically high in value so it's important to normalize before PCA. If you want to apply SVM on top of it normalization is fine.

The basic says that scale should remain same for any process thus after normalization you should use the same scale that is same max and min.


Depending on the language of your choice, you may or may not have to normalise the data yourself - for example the e1071 package in R does this automatically for you. As this is built off the 'libsvm' library, it might be that this is also the case there - the library documentation is your best source. As for the normalisation values, you should definitely use the same [min,max] values from the training set for the test set as well. Also, I would reccomend reducing the dimensionality of your data first, than normalising before running a svm. hth

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.