How to load and run feature selection on a dataset with 5,000 samples and 500,000 features?

I have a dataset with 5000 samples and 500,000 features (all categorical with a cardinality of 3).

Two problems I'm trying to solve:

  1. Loading the dataset - I can't load it in memory despite using a computing cluster, so I'm assuming I should use a parallelization library like Dask, Spark, or Vaex. Is this the best idea?
  2. Feature selection - how to run feature selection within a parallelization library? Can this be done with Dask, Spark, Vaex?

Topic parallel machine-learning

Category Data Science


For the first part, I guess your matrix should be sparse. You can convert your matrix into sparse and then read it into memory.

For the second part, it depends on the sparsity of your matrix is and how many features you want to select. One way is to get the top n variable features, run PCA and get the top m PCs. n and m depend on the sparcity of your matrix. n can be a value between 5000 to 50000 and you can define m by plotting the variance for each PC and finding the inflection point.


5000 samples and 500,000 is not that big - it all depends how much memory you have. Also remember you can always and always optimize your data format. e.g. if they are float64 - do they need to be ? if they are categorical, how they are encoded ? (one character or a 20 character word?) and such. so Yes, if you can load the data into memory good for you if not here are the suggestions:

  1. if you only and only have 5K samples - you must not use all for feature selection.
  2. you can drop features that have very low variance - in an extreme scenario if the variance of a column is 0 - for sure it is useless.
  3. there is something called feature-screening proposed by Fan et. al from Princeton https://orfe.princeton.edu/abstracts/feature-screening-distance-correlation-learning - in short: you can lower your dimension by using a univariat model and then afterwards use multivariate-feature selection models.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.