How can I use two different datasets as a training model for svm

I know that you're supposed to scale your test data using the parameters (mean and stdev) from your training data. This is relatively simple; but what if the number of samples is limited in one training data set (e.g. Set A = 5 samples) so I want to combine two data sets (i.e. Set A + Set B = 10 samples) to have enough samples for training, what can I do so that I can scale/normalize the two sets into one and then use those parameters on my test set? If I scale them individually I will have 2 means and 2 stdev.

The context is I'm trying to combine two microarray expression from two different microarray platform so their expression ranges are different.

Thank you for your help in advance

Topic svm r data-mining machine-learning

Category Data Science


I think that what you need is some preprocessing technique such as quantile normalization. You can check this document by Jeff Leek on quantile normalization. In the tutorial he uses R code for normalizing two studies from different populations but on same genes.


From a proper methodological standpoint, you should do the scaling you're proposing after the two sets are merged.

Either way, the model is not going to be able to differentiate between, say, an anomalous reading from one generating process that falls into the range generated by the other generating process unless you also make a variable indicating the source system that the observation came from (assuming all else is equal).

You need to make sure that both sets of observations actually represent the same population of possible observations in order to make this modeling decision.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.