How to interpret scikitlearn pca components output

Question

How to interpret scikitlearn pca components output

Bryon

2022年5月8日 20:45

I am trying to use PCA with scikitlearn for feature selection and there is something about PCA that I am not understanding. Can someone please fill in the blanks for me? I have a normalised dataset with 11 components. The output of PCA is:

==================================================
Explained Variance:
[0.29673715 0.15425831 0.10136684 0.09121094 0.09012841 0.08089791
 0.07294822 0.04842635 0.0290573  0.0249145  0.01005407]
==================================================
Cumulative Explained Variance:
[ 29.67371513  45.09954607  55.23623054  64.35732446  73.37016542
  81.45995668  88.75477905  93.5974138   96.50314332  98.9945929
 100.        ]
==================================================

The graph of the cumulative variance is:

I would like to select the best features for capturing at least 90% variance. So from this it means I need 8 features

Now - this is where I am getting stuck in my understanding... The output of:

n_components=min(train_X.shape[0], train_X.shape[1])
print(Number of comoponets: {}.format(n_components))
pca = PCA(n_components)
Xnew = pca.fit_transform(Xnorm)
print(pca.components_)

is:

[[-4.54963949e-01  3.52251748e-01  1.10835672e-01  3.24987376e-02
   1.50839952e-01  4.89800984e-01  5.07587627e-01 -2.55056877e-01
  -1.51085428e-01  2.17680369e-01  4.16434152e-03]
 [ 3.11238184e-02  3.16571309e-01  3.71425795e-01  1.16885130e-01
   2.34244968e-01  1.94162813e-01  6.93267053e-05  5.73375241e-01
   1.55866200e-02 -5.70002237e-01 -2.68345326e-02]
 [ 8.93789369e-02  5.78086627e-02 -1.96617492e-01  7.88298969e-01
   5.18661364e-01 -1.27086158e-01 -9.17946249e-02 -9.49203239e-02
   5.79563898e-02  1.44485999e-01  4.20010936e-02]
 [-1.58722853e-01 -2.27506222e-02 -3.00837400e-01 -1.15388113e-02
   1.43405517e-02  7.48094459e-02  6.02387058e-02  6.00762155e-02
   4.25343554e-01 -7.73269870e-02 -8.26871586e-01]
 [-2.09948047e-01 -8.48143267e-03 -4.01247348e-01 -8.14759788e-02
   9.18873965e-04  1.13669361e-01  1.09267168e-01  1.16682505e-01
   6.44328962e-01 -1.46582716e-01  5.59544565e-01]
 [ 1.60325194e-01  2.15310361e-01  5.76206458e-01  2.45075331e-01
  -3.75232499e-01 -1.69211786e-03 -6.07564983e-02 -1.84233535e-01
   5.51880866e-01  2.29594827e-01 -1.18138407e-02]
 [ 1.58542278e-01 -1.33956129e-01  2.96423766e-01 -4.82608786e-01
   7.10257199e-01 -3.10976685e-02 -4.38346290e-02 -9.38340797e-02
   2.64559513e-01  2.23200607e-01 -1.31492302e-02]
 [ 2.74850979e-01  8.17787439e-01 -3.30529428e-01 -2.47586776e-01
   7.61209475e-03 -9.47284784e-02 -2.30527630e-01 -2.88399787e-02
  -3.42614961e-02  1.44400836e-01 -1.00058196e-02]
 [ 1.27772687e-01  6.26879208e-02  1.56125782e-02 -2.78229777e-02
   6.50384673e-02 -9.97115519e-02  8.80130871e-02 -7.09045280e-01
   3.12874626e-02 -6.73092437e-01 -5.50018711e-03]
 [ 7.54505767e-01 -1.48649751e-01 -1.69731021e-01  4.17530648e-02
  -6.00307878e-02  4.55558919e-01  4.02189383e-01  5.61189972e-02
   2.53346498e-03  4.40967138e-02 -8.84058405e-03]
 [ 5.28365150e-02  1.12535903e-01  3.76073286e-02 -1.74684995e-02
  -2.03451799e-03 -6.78460017e-01  7.00072931e-01  1.74584853e-01
   2.77661197e-02  3.14436347e-02 -1.30667938e-02]]

Looking at row 0 I see that I need these features to get my 90% variance: 0, 1, 3, 5, 6, 7, 9, 10.

But what are the next 10 rows showing?

Can I just use the new transformed dataset (Xnew) and those 8 features?

What am I missing? :-)

Topic pca scikit-learn

Category Data Science

How to interpret scikitlearn pca components output

About