pca - Geeks Mental

Using PCA as features for production

Humpalum Druf

2022年6月1日 04:04

I struggle with figuring out how to proceed with taking PCA into production in order to test my Models with unknown samples. I'm using both an One-Hot-Encoding an an TF-IDF in order to classify my elements with various models, mainly KNN. I know i can use the pretrained One-Hot-Encoder and the TF-IDF encoder in order to encode the new elements in order to match the final feature Vector. Since these feature vectors become very large i use an PCA in …

Topic: feature-reduction pca scikit-learn feature-selection

Category: Data Science

Principal Component Analysis (PCA) for non-continuous numerical data

BlueSea

2022年5月27日 07:26

I am learning PCA and the question is the following: can be PCA applied to a dataset containing both numerical continuous and numerical discrete variables? Thank you

Topic: pcamixdata pca

Category: Data Science

Which algorithm can be used to reduce dimension of multiple time series?

Faiz Kidwai

2022年5月26日 11:02

In my dataset, a data point is essentially a Time series of 6 feature over a year per month so in all, it results in 6*12=72 features. I need to find class outliers so I perform dimensionality reduction hoping the difference in data is maintained and then apply k-means clustering and compute distance. For dimensionality reduction I have tried PCA and simple autoencoder to reduce dimension from 72 to 6 but results are unsatisfactory. Can anyone please suggest any other …

Topic: pytorch pca autoencoder python dimensionality-reduction

Category: Data Science

Does PCA helps to include all the variables even if there is high collinearity among variables?

NAS

2022年5月25日 21:01

I have a dataset that has high collinearity among variables. When I created the linear regression model, I could not include more than five variables ( I eliminated the feature whenever VIF>5). But I need to have all the variables in the model and find their relative importance. Is there any way around it?. I was thinking about doing PCA and creating models on principal components. Does it help?.

Topic: collinearity pca linear-regression

Category: Data Science

PCA and orange software

ACIU

2022年5月23日 19:00

I am analysing if 15 books can be grouped according to 6 variables (of the 15 books, 2 are written by an author, 6 by an other one, and 7 by an other one). I counted the number of occurrences of the variables and I calculated the percentage. Then I used Orange software to use PCA. I uploaded the file. selected the columns and row. And when it comes to PCA the program asks me if I want to normalize …

Topic: orange3 pca orange normalization

Category: Data Science

How can I adjust the legend when visualizing clusters in two dimensions?

Cecilia

2022年5月22日 16:03

How can I change the legend as we can see now the legend has some cluster numbers missing. How can I adjust the legend so that it can show all the cluster numbers (such as Cluster 1, Cluster 2 etc, no it's only 0 3 6 9)? (codes I followed this link: Perform k-means clustering over multiple columns) kmeans = KMeans(n_clusters=10) y2 = kmeans.fit_predict(scaled_data) reduced_scaled_data = PCA(n_components=2).fit_transform(scaled_data) results = pd.DataFrame(reduced_scaled_data,columns=['pca1','pca2']) sns.scatterplot(x="pca1", y="pca2", hue=y2, data=results) #y2 is my cluster number plt.title('K-means …

Topic: matplotlib pca python clustering

Category: Data Science

Python sklearn PCA transform function output does not match

shaifali Gupta

2022年5月22日 01:00

I am computing PCA on some data using 10 components and using 3 out of 10 as: transformer = PCA(n_components=10) trained=transformer.fit(train) one=numpy.matmul(train,numpy.transpose(trained.components_[:3,:])) Here trained.components_[:3,:] are: array([[-1.43311999e-03, 1.65635865e-01, 5.49189565e-01, 5.26069645e-02, 2.42638594e-01, 1.20957807e-02, 1.30595572e-01, 1.09279646e-02, 7.21299808e-03, -2.79057934e-02, -1.14834589e-02, 5.06289160e-01, 5.42890317e-01, 8.50422194e-02, 1.80935205e-01, 2.98473275e-05, -8.04537378e-04], [-1.05419313e-02, 3.09442577e-01, -8.15534934e-02, 4.28621520e-03, 2.93323569e-01, 3.85849115e-02, -1.16193185e-01, 4.14964652e-01, 4.16279154e-01, 2.95264788e-01, 3.28620106e-01, -2.60916490e-01, -2.37459426e-02, 1.57567265e-01, 4.02873342e-01, 5.28389303e-05, -2.07920000e-03], [ 8.63072772e-03, -3.26129082e-01, 8.59869400e-02, 3.04770780e-03, -3.14966419e-01, -2.47151330e-02, 1.05987767e-01, 3.74235953e-01, 3.75747065e-01, 2.76035253e-01, 3.18273743e-01, 3.02423861e-01, 2.76535177e-02, -1.51485057e-01, -4.48558170e-01, -8.83328996e-05, -2.25542180e-03]]) and using only …

Topic: pca scikit-learn python

Category: Data Science

Whether to use LDA or QDA

Peter

2022年5月18日 09:25

I'm trying to determine whether it's best to use linear or quadratic discriminant analysis for an analysis that I'm working on. It's my understanding that one of the motivations for using QDA over LDA is that it deals better with circumstances in which the variance of the predictors is not constant across the classes being predicted. This is true for my data, however I intend to carry out principal components analysis beforehand. Because this PCA will involve scaling/normalising the variables, …

Topic: variance inference pca classification

Category: Data Science

Dataset of extremely low-dimensional images for PCA

elliotp

2022年5月14日 13:00

I am looking for a public data-set of images that differ from each other only slightly, so that after applying PCA they can be reconstructed with a small error from very few PCA coefficients. It can be any type of images, the purpose is only to demonstrate an extreme example of PCA.

Topic: pca dataset machine-learning

Category: Data Science

PCA huge parts of missing data filling

Simon Nicholls

2022年5月11日 20:06

I’m performing PCA on different time series’ and then using K Means clustering to try and group together common factors. The issue I’m facing is that some of the factors come in and out of the time series. For example I may have 12 years in total of data points, some factors may exist for the entire 12 years but some may dip in and out (active for the first two years, inactive for three years, active for the rest …

Topic: pca data-cleaning k-means

Category: Data Science

Hard time finding literature on feature clustering using Principal Component Analysis

aryan

2022年5月11日 12:23

Im new to StackExchange, so i am sorry if this is not the right way to ask a question on StackExhange. For my thesis I wish to propose a methode for future research on using PCA to cluster features (feature clustering) and then apply per-cluster PCA. I got the idea from this paper: this paper. But I have a hard time finding literature about PCA being used to cluster variables (not reduce variables). I could imagine that it is not …

Topic: features pca clustering

Category: Data Science

How to interpret scikitlearn pca components output

Bryon

2022年5月8日 20:45

I am trying to use PCA with scikitlearn for feature selection and there is something about PCA that I am not understanding. Can someone please fill in the blanks for me? I have a normalised dataset with 11 components. The output of PCA is: ================================================== Explained Variance: [0.29673715 0.15425831 0.10136684 0.09121094 0.09012841 0.08089791 0.07294822 0.04842635 0.0290573 0.0249145 0.01005407] ================================================== Cumulative Explained Variance: [ 29.67371513 45.09954607 55.23623054 64.35732446 73.37016542 81.45995668 88.75477905 93.5974138 96.50314332 98.9945929 100. ] ================================================== The graph of the …

Topic: pca scikit-learn

Category: Data Science

What best/correct algorithm/procedure to cluster a dataset with a lot 0's?

Lucas

2022年5月4日 22:01

I'm new to statistics so sorry any major lack of knowledge in the topic, just doing a project for graduation. I'm trying to cluster a Health dataset containing Diseases(3456) and Symptoms(25) grouping them considering the number of events occurred. My concern is that a lot of the values are 0 simple because some diseases didn't show that particularly symptom, for example (I made up the values for now): So, I was wondering what was the best way to cluster this …

Topic: pca missing-data data k-means clustering

Category: Data Science

Using PCA to cluster multidimensional data (RFM variables)

Pads

2022年5月3日 01:00

So i am performing k-means clustering on RFM variables (Recency, Frequency, Monetary). The RFM variables are in the form of quantiles (1-4). I used PCA and found the PCA components. I then used the elbow method to find the optimal number of clusters and then I use it in the k-means algorithm. Could anyone guide me if this is a correct method? Further, the clusters I get range on the graph, their axis ranges from -3 to 3 and I …

Topic: pca hierarchical-data-format k-means clustering

Category: Data Science

Eigenvectors of points on a straight line PCA1 and PCA2

giacomo venturelli

2022年4月30日 13:34

let immagine that I have 3 points and they are all on a sloped straigh line such as (-4, -2) (0,0) (2,1) this is straight line passing from the origin. Intuitevely pca2 would be 0 as I have not up spread in the data. And PCA1 would be the max variance of the data on the line. Is my intuition correct for pca2 = 0? is there any intuitive way to calculate PCA1 for this case scenario?

Topic: pca

Category: Data Science

Dimensionality Reduction of Curved Structural Data

Bryon

2022年4月24日 21:21

I have been using PCA dimensionality reduction on datasets that are quite linear and now I am tasked with the same on datasets that are largely curved in space. Imagine a noisy sine wave for simplicity. Is PCA still useful in this scenario? If not, what is a more appropriate dimensionality reduction method?

Topic: pca dimensionality-reduction

Category: Data Science

Guidance needed with dimension reduction for clustering - some numerical, lots of categorical data

zinyosrim

2022年4月22日 09:08

I've my data in a Pandas df with 25.000 rows and 1.500 columns without any NaNs. Of the columns about 30 contain numerical data which I standardized with StandardScaler(). The rest are cols with binary values which originated from cols with categorical data. (used pd.get_dummies() for this) Now I'd like to reduce the dimensions. I'm already running from sklearn.decomposition import PCA pca = PCA(n_components=2) pca.fit(df) for three hours and I asked my self if my approach was correct. I also …

Topic: pca scikit-learn pandas python dimensionality-reduction

Category: Data Science

How to tell how much information I lose when I simplify the graph data structure with respect to unsimplified graph?

Daniel Wiczew

2022年4月9日 11:01

I have the following problem: I have some sort of data (that I can't publish here, but they are in the form of points with XYZ coordinates) and I can represent them as a collection of graphs i.e. $Q = \{G_1, G_2 ... G_t\}$, where for every node there is an associated set of features, e.g. node $u_i$ has feature vector $\mathcal{F}_i$ and the features are changing between graphs (but graph structure does not). The resulting graphs are big in …

Topic: pca graphs

Category: Data Science

Is it always possible to get well-defined clusters from the data?

Oleg Ivanytskyi

2022年4月8日 22:53

I have TV watching data and I have been trying to cluster it to get different sets of watchers. My dataset consists of 64 features (such as total watching time, percent of ads skipped, movies vs. shows, etc.). All the variables are either numerical or binary. But no matter how I treat them (normalize them, standardized, leave them as is, take a subset of features, etc.), I always end up getting pictures similar to this: This particular picture was constructed …

Topic: tsne pca clustering machine-learning

Category: Data Science

Using PCA for Dimensionality Expansion

yeyosef

2022年4月8日 22:50

I was trying to use t-SNE algorithm for dimensionality reduction and I know this was not the primary usage of this algorithm and not recommended. I saw an implementation here. I am not convinced about this implementation on t-SNE. The algorithm works like this: Given a training dataset and a test dataset, combine the 2 together into one full dataset Run t-SNE on the full dataset (excluding the target variable) Take the output of the t-SNE and add it as …

Topic: pca dimensionality-reduction

Category: Data Science

About