Interpreting the results of randomized PCA in scikit-learn

I'm using scikit-learn to do a genome-wide association study with a feature vector of about 100K SNPs. My goal is to tell the biologists which SNPs are "interesting".

RandomizedPCA really improved my models, but I'm having trouble interpreting the results. Can scikit-learn tell me which features are used in each component?

Topic randomized-algorithms pca scikit-learn feature-selection

Category Data Science


Yes, through the components_ property:

import numpy, seaborn, pandas, sklearn.decomposition
data = numpy.random.randn(1000, 3) @ numpy.random.randn(3,3)
seaborn.pairplot(pandas.DataFrame(data, columns=['x', 'y', 'z']));

Faceted scatter plot

sklearn.decomposition.RandomizedPCA().fit(data).components_

> array([[ 0.43929754,  0.81097276,  0.38644644],
       [-0.54977152,  0.58291122, -0.59830243],
       [ 0.71047094, -0.05037554, -0.70192119]])

sklearn.decomposition.RandomizedPCA(2).fit(data).components_

> array([[ 0.43929754,  0.81097276,  0.38644644],
       [-0.54977152,  0.58291122, -0.59830243]])

We see that the truncated decomposition is simply the truncation of the full decomposition. Each row contains the coefficients of the corresponding principal component.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.