Using PCA for Dimensionality Expansion

I was trying to use t-SNE algorithm for dimensionality reduction and I know this was not the primary usage of this algorithm and not recommended. I saw an implementation here. I am not convinced about this implementation on t-SNE.

The algorithm works like this:

  1. Given a training dataset and a test dataset, combine the 2 together into one full dataset
  2. Run t-SNE on the full dataset (excluding the target variable)
  3. Take the output of the t-SNE and add it as K new columns to the full dataset, K being the mapping dimensionality of t-SNE.
  4. Re-split the full dataset into training and test
  5. Split the training dataset into N folds
  6. Train your machine learning model on the N folds and doing N-fold cross-validation
  7. Evaluate the machine learning model on the test dataset

My main questions are not about the t-SNE but;

  • Can I use this algorithm below for other dimensionality reduction algorithms such as PCA by splitting dataset into train and test sets before transforming the data?
  • Would this be effective?

Dimensionality is not a problem for my dataset because it is already a small one. Having highly correlated features also not important.

Topic pca dimensionality-reduction

Category Data Science


Your algorithm may work only if embeddings created by the manifold learning (T-SNE) catch information that the features by themselves do not.

As mentioned in the comments, if you use T-SNE, you will have to fit and predict the same data, leading to leakage. An alternative would be using UMAP, so your approach would be:

For $K$ in number_of_folds:

  1. Fit UMAP on the train set, excluding the target variable.

  2. Take the output of the UMAP and add it as $U$ new columns to the full dataset, $U$ being the mapping dimensionality of UMAP.

  3. Train your machine learning model on the $K-1$ folds.

  4. Evaluate the machine learning model on the $K$ fold.


This approach strikes me as bad practice, since it immediately torpedoes the hope of having any independent test data in which to have an unbiased measure of algorithm performance. By combining all the data and performing t-SNE, you are generating t-SNE features that have explanatory power in both the training and the testing data. No matter how you split the data after that, there can be no truly independent test data, since all of the data was used to define the features in the first place.

I'm not at all surprised to see an apparent improvement in performance statistics from this approach, since it is a biased method that has "peeked" at the test data, and will likely be overoptimistic. You should never perform feature selection/dimensionality reduction before splitting into train/test datasets, or else you are contaminating the process with the test data which should only be used at the very end after the model is built. Using the test data for anything other than testing (in this case, dimensionality reduction) will introduce bias into your evaluation.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.