Interpreting the results of randomized PCA in scikit-learn

Question

Interpreting the results of randomized PCA in scikit-learn

retsreg

2016年3月8日 02:34

I'm using scikit-learn to do a genome-wide association study with a feature vector of about 100K SNPs. My goal is to tell the biologists which SNPs are "interesting".

RandomizedPCA really improved my models, but I'm having trouble interpreting the results. Can scikit-learn tell me which features are used in each component?

Topic randomized-algorithms pca scikit-learn feature-selection

Category Data Science

Emre · Accepted Answer · 2016年3月8日 02:34

Yes, through the components_ property:

import numpy, seaborn, pandas, sklearn.decomposition
data = numpy.random.randn(1000, 3) @ numpy.random.randn(3,3)
seaborn.pairplot(pandas.DataFrame(data, columns=['x', 'y', 'z']));

sklearn.decomposition.RandomizedPCA().fit(data).components_

> array([[ 0.43929754,  0.81097276,  0.38644644],
       [-0.54977152,  0.58291122, -0.59830243],
       [ 0.71047094, -0.05037554, -0.70192119]])

sklearn.decomposition.RandomizedPCA(2).fit(data).components_

> array([[ 0.43929754,  0.81097276,  0.38644644],
       [-0.54977152,  0.58291122, -0.59830243]])

We see that the truncated decomposition is simply the truncation of the full decomposition. Each row contains the coefficients of the corresponding principal component.

Interpreting the results of randomized PCA in scikit-learn

About