How to load and run feature selection on a dataset with 5,000 samples and 500,000 features?
I have a dataset with 5000 samples and 500,000 features (all categorical with a cardinality of 3).
Two problems I'm trying to solve:
- Loading the dataset - I can't load it in memory despite using a computing cluster, so I'm assuming I should use a parallelization library like Dask, Spark, or Vaex. Is this the best idea?
- Feature selection - how to run feature selection within a parallelization library? Can this be done with Dask, Spark, Vaex?
Topic parallel machine-learning
Category Data Science