How to interpret scikitlearn pca components output
I am trying to use PCA with scikitlearn for feature selection and there is something about PCA that I am not understanding. Can someone please fill in the blanks for me? I have a normalised dataset with 11 components. The output of PCA is:
==================================================
Explained Variance:
[0.29673715 0.15425831 0.10136684 0.09121094 0.09012841 0.08089791
0.07294822 0.04842635 0.0290573 0.0249145 0.01005407]
==================================================
Cumulative Explained Variance:
[ 29.67371513 45.09954607 55.23623054 64.35732446 73.37016542
81.45995668 88.75477905 93.5974138 96.50314332 98.9945929
100. ]
==================================================
The graph of the cumulative variance is:
I would like to select the best features for capturing at least 90% variance. So from this it means I need 8 features
Now - this is where I am getting stuck in my understanding... The output of:
n_components=min(train_X.shape[0], train_X.shape[1])
print(Number of comoponets: {}.format(n_components))
pca = PCA(n_components)
Xnew = pca.fit_transform(Xnorm)
print(pca.components_)
is:
[[-4.54963949e-01 3.52251748e-01 1.10835672e-01 3.24987376e-02
1.50839952e-01 4.89800984e-01 5.07587627e-01 -2.55056877e-01
-1.51085428e-01 2.17680369e-01 4.16434152e-03]
[ 3.11238184e-02 3.16571309e-01 3.71425795e-01 1.16885130e-01
2.34244968e-01 1.94162813e-01 6.93267053e-05 5.73375241e-01
1.55866200e-02 -5.70002237e-01 -2.68345326e-02]
[ 8.93789369e-02 5.78086627e-02 -1.96617492e-01 7.88298969e-01
5.18661364e-01 -1.27086158e-01 -9.17946249e-02 -9.49203239e-02
5.79563898e-02 1.44485999e-01 4.20010936e-02]
[-1.58722853e-01 -2.27506222e-02 -3.00837400e-01 -1.15388113e-02
1.43405517e-02 7.48094459e-02 6.02387058e-02 6.00762155e-02
4.25343554e-01 -7.73269870e-02 -8.26871586e-01]
[-2.09948047e-01 -8.48143267e-03 -4.01247348e-01 -8.14759788e-02
9.18873965e-04 1.13669361e-01 1.09267168e-01 1.16682505e-01
6.44328962e-01 -1.46582716e-01 5.59544565e-01]
[ 1.60325194e-01 2.15310361e-01 5.76206458e-01 2.45075331e-01
-3.75232499e-01 -1.69211786e-03 -6.07564983e-02 -1.84233535e-01
5.51880866e-01 2.29594827e-01 -1.18138407e-02]
[ 1.58542278e-01 -1.33956129e-01 2.96423766e-01 -4.82608786e-01
7.10257199e-01 -3.10976685e-02 -4.38346290e-02 -9.38340797e-02
2.64559513e-01 2.23200607e-01 -1.31492302e-02]
[ 2.74850979e-01 8.17787439e-01 -3.30529428e-01 -2.47586776e-01
7.61209475e-03 -9.47284784e-02 -2.30527630e-01 -2.88399787e-02
-3.42614961e-02 1.44400836e-01 -1.00058196e-02]
[ 1.27772687e-01 6.26879208e-02 1.56125782e-02 -2.78229777e-02
6.50384673e-02 -9.97115519e-02 8.80130871e-02 -7.09045280e-01
3.12874626e-02 -6.73092437e-01 -5.50018711e-03]
[ 7.54505767e-01 -1.48649751e-01 -1.69731021e-01 4.17530648e-02
-6.00307878e-02 4.55558919e-01 4.02189383e-01 5.61189972e-02
2.53346498e-03 4.40967138e-02 -8.84058405e-03]
[ 5.28365150e-02 1.12535903e-01 3.76073286e-02 -1.74684995e-02
-2.03451799e-03 -6.78460017e-01 7.00072931e-01 1.74584853e-01
2.77661197e-02 3.14436347e-02 -1.30667938e-02]]
Looking at row 0 I see that I need these features to get my 90% variance: 0, 1, 3, 5, 6, 7, 9, 10.
But what are the next 10 rows showing?
Can I just use the new transformed dataset (Xnew) and those 8 features?
What am I missing? :-)
Topic pca scikit-learn
Category Data Science