Guidance needed with dimension reduction for clustering - some numerical, lots of categorical data

Question

Guidance needed with dimension reduction for clustering - some numerical, lots of categorical data

zinyosrim

2022年4月22日 09:08

I've my data in a Pandas df with 25.000 rows and 1.500 columns without any NaNs. Of the columns about 30 contain numerical data which I standardized with StandardScaler(). The rest are cols with binary values which originated from cols with categorical data. (used pd.get_dummies() for this)

Now I'd like to reduce the dimensions. I'm already running

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
pca.fit(df)

for three hours and I asked my self if my approach was correct. I also saw two variants of PCA, one for sparse data. Does it mean that it doesn't make sense to run PCA in such a mixed scenario?

As I was up to now busy with cleaning and transforming my data, I'd like to understand what a good strategy would be to eliminate irrelevant columns.

I'd appreciate some hints to move forward.

Topic pca scikit-learn pandas python dimensionality-reduction

Category Data Science

Michael_S · Accepted Answer · 2018年11月15日 12:05

1

Michael_S answered at 2018年11月15日 12:05

There are many ways to get rid of redundant dimensions. The choice wheter to do it or not depends on what kind of problem you want so solve and what kind of algorithm you plan to choose.

Guidance needed with dimension reduction for clustering - some numerical, lots of categorical data

About