Using PCA as features for production

I struggle with figuring out how to proceed with taking PCA into production in order to test my Models with unknown samples.

I'm using both an One-Hot-Encoding an an TF-IDF in order to classify my elements with various models, mainly KNN.

I know i can use the pretrained One-Hot-Encoder and the TF-IDF encoder in order to encode the new elements in order to match the final feature Vector. Since these feature vectors become very large i use an PCA in order to reduce the dimension of them.

This is an example of how i currently pre process the features:

if method == pca_tfidf:
        df['tfidf'], featurenames= hero.clean(df['OpcodeString']).pipe(hero.tfidf, return_feature_names=True)
        X = df['tfidf']
        X = pd.DataFrame(item for item in X)
        if maxFeatures = X.shape[1]:
            maxFeatures = X.shape[1]
        y = df[%n class] # Split label from data
        ActualY = df[actual class]
        
        Final_PCA = PCA(n_components= maxFeatures ,random_state=42)
        Final_PCA.fit(X)
        X_pca = Final_PCA.transform(X)
        
        total_var = Final_PCA.explained_variance_ratio_.sum() * 100
        printPCA(X_pca, total_var)
        
        X = pd.DataFrame(item for item in X_pca)

Now what i don't understand is how to proceed with new unknown samples. Since PCA is calculated in relation to every element in the train dataset, would I have to recalculate every component new with all new samples? Wouldnt that falsify with what values the model has been trained? Can i just use Final_PCA.fit(X_new) with the TFIDF Values of the unknown samples in order to produce a feature vector which the model can then classify?

Additionally im questioning myself weather my approach is good at all. I have read a few times that PCA seems to not be great for categorical feature values. However i was able to achieve great results for my approach when i calculate the TFIDF and PCA first and then do the Test/Train split afterwards.

Thanks for any help!

Topic feature-reduction pca scikit-learn feature-selection

Category Data Science


In scikit-learn, PCA has the fit_transform method which fits and applies the dimensionality reduction to the training data. There is also transform which only applies the dimensionality reduction. With new unknown samples, use transform.


I think you made a mistake here:

i was able to achieve great results for my approach when i calculate the TFIDF and PCA first and then do the Test/Train split afterwards.

This evaluation was flawed due to data leakage: both the TFIDF and the PCA should be calculated using only the training set.

This is more or less the source of your issue with how to proceed in production: normally in production one should just apply the same process as with your test set, but here you were able to prepare your test set differently (and wrongly).

  • For the TFIDF step, the IDF should be computed only using the training set. Note that in the test set you might have to deal with out of vocabulary words.
  • For the PCA step, I'm not too sure how to proceed but I found this explanation.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.