Pre-process data images before training OneClassSVM and decrease number of features

I want to train a OneClassSVM() using sklearn, and I have a set of around 800 images in my training set.

I am using opencv to read the images and resize them to constant dimensions (960x540) and then adding them to a numpy-array. The images are RGB and thus have 3-dimensions. For that, I am reshaping the numpy array after reading all the images:

#Assume X is my numpy array which contains all the images before reshaping
#Now I reshape X
n_samples = len(X)
X = X.reshape(n_samples, 950*540*3)

As you can see, the number of features is huge (1,539,000 to be exact).

Now I try to train my model:

model = OneClassSVM(kernel='rbf', gamma=0.001)
model.fit(X)

After running my code, it crashed due to MemoryError. If I'm not mistaken this is obvious due the large number of features? So, is there a better way to pre-process the images before fitting them, or to decrease the number of features?

Topic numpy preprocessing scikit-learn python machine-learning

Category Data Science


One approach is to use an artificial neural network to extract features representing the images. This can be done either by using a pre-configured network with pre-trained weights and extracting the output of one of the hidden layers, or by constructing and training your own network for this purpose.

To use a pre-configured, pre-trained module can be accomplished easily with Keras and TensorFlow, where you can import InceptionV3 or MobileNet with weights pre-trained on ImageNet, which would net you 2048 or 1024 features per image, respectively.

An article discussing such an approach can be found here. This could hopefully give you better performance than using something like PCA to conduct dimensionality reduction.


You should try converting them to Principle Components using PCA. Please refer this Analytics Vidya PCA, this should give give you a good understanding.

PCA converts n large vector into p Principle components.

First principal component is a linear combination of original predictor variables which captures the maximum variance in the data set

Second PC is also a linear combination of original predictor variables which captures remaining variance. All the succeeding ones follow the same concept.

This way you can select top Principle components which explains good enough cumulative variance in your data

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.