Using PCA as features for production

Question

Using PCA as features for production

Humpalum Druf

2022年6月1日 04:04

I struggle with figuring out how to proceed with taking PCA into production in order to test my Models with unknown samples.

I'm using both an One-Hot-Encoding an an TF-IDF in order to classify my elements with various models, mainly KNN.

I know i can use the pretrained One-Hot-Encoder and the TF-IDF encoder in order to encode the new elements in order to match the final feature Vector. Since these feature vectors become very large i use an PCA in order to reduce the dimension of them.

This is an example of how i currently pre process the features:

if method == pca_tfidf:
        df['tfidf'], featurenames= hero.clean(df['OpcodeString']).pipe(hero.tfidf, return_feature_names=True)
        X = df['tfidf']
        X = pd.DataFrame(item for item in X)
        if maxFeatures = X.shape[1]:
            maxFeatures = X.shape[1]
        y = df[%n class] # Split label from data
        ActualY = df[actual class]
        
        Final_PCA = PCA(n_components= maxFeatures ,random_state=42)
        Final_PCA.fit(X)
        X_pca = Final_PCA.transform(X)
        
        total_var = Final_PCA.explained_variance_ratio_.sum() * 100
        printPCA(X_pca, total_var)
        
        X = pd.DataFrame(item for item in X_pca)

Now what i don't understand is how to proceed with new unknown samples. Since PCA is calculated in relation to every element in the train dataset, would I have to recalculate every component new with all new samples? Wouldnt that falsify with what values the model has been trained? Can i just use Final_PCA.fit(X_new) with the TFIDF Values of the unknown samples in order to produce a feature vector which the model can then classify?

Additionally im questioning myself weather my approach is good at all. I have read a few times that PCA seems to not be great for categorical feature values. However i was able to achieve great results for my approach when i calculate the TFIDF and PCA first and then do the Test/Train split afterwards.

Thanks for any help!

Topic feature-reduction pca scikit-learn feature-selection

Category Data Science

Brian Spiering · Accepted Answer · 2021年12月30日 00:34

In scikit-learn, PCA has the fit_transform method which fits and applies the dimensionality reduction to the training data. There is also transform which only applies the dimensionality reduction. With new unknown samples, use transform.

Erwan · Accepted Answer · 2020年11月30日 15:32

I think you made a mistake here:

i was able to achieve great results for my approach when i calculate the TFIDF and PCA first and then do the Test/Train split afterwards.

This evaluation was flawed due to data leakage: both the TFIDF and the PCA should be calculated using only the training set.

This is more or less the source of your issue with how to proceed in production: normally in production one should just apply the same process as with your test set, but here you were able to prepare your test set differently (and wrongly).

For the TFIDF step, the IDF should be computed only using the training set. Note that in the test set you might have to deal with out of vocabulary words.
For the PCA step, I'm not too sure how to proceed but I found this explanation.

Using PCA as features for production

About