Using PCA as features for production
I struggle with figuring out how to proceed with taking PCA into production in order to test my Models with unknown samples.
I'm using both an One-Hot-Encoding an an TF-IDF in order to classify my elements with various models, mainly KNN.
I know i can use the pretrained One-Hot-Encoder and the TF-IDF encoder in order to encode the new elements in order to match the final feature Vector. Since these feature vectors become very large i use an PCA in order to reduce the dimension of them.
This is an example of how i currently pre process the features:
if method == pca_tfidf:
df['tfidf'], featurenames= hero.clean(df['OpcodeString']).pipe(hero.tfidf, return_feature_names=True)
X = df['tfidf']
X = pd.DataFrame(item for item in X)
if maxFeatures = X.shape[1]:
maxFeatures = X.shape[1]
y = df[%n class] # Split label from data
ActualY = df[actual class]
Final_PCA = PCA(n_components= maxFeatures ,random_state=42)
Final_PCA.fit(X)
X_pca = Final_PCA.transform(X)
total_var = Final_PCA.explained_variance_ratio_.sum() * 100
printPCA(X_pca, total_var)
X = pd.DataFrame(item for item in X_pca)
Now what i don't understand is how to proceed with new unknown samples. Since PCA is calculated in relation to every element in the train dataset, would I have to recalculate every component new with all new samples? Wouldnt that falsify with what values the model has been trained? Can i just use Final_PCA.fit(X_new) with the TFIDF Values of the unknown samples in order to produce a feature vector which the model can then classify?
Additionally im questioning myself weather my approach is good at all. I have read a few times that PCA seems to not be great for categorical feature values. However i was able to achieve great results for my approach when i calculate the TFIDF and PCA first and then do the Test/Train split afterwards.
Thanks for any help!
Topic feature-reduction pca scikit-learn feature-selection
Category Data Science