Using PCA as features for production

I struggle with figuring out how to proceed with taking PCA into production in order to test my Models with unknown samples. I'm using both an One-Hot-Encoding an an TF-IDF in order to classify my elements with various models, mainly KNN. I know i can use the pretrained One-Hot-Encoder and the TF-IDF encoder in order to encode the new elements in order to match the final feature Vector. Since these feature vectors become very large i use an PCA in …
Category: Data Science

Which algorithm can be used to reduce dimension of multiple time series?

In my dataset, a data point is essentially a Time series of 6 feature over a year per month so in all, it results in 6*12=72 features. I need to find class outliers so I perform dimensionality reduction hoping the difference in data is maintained and then apply k-means clustering and compute distance. For dimensionality reduction I have tried PCA and simple autoencoder to reduce dimension from 72 to 6 but results are unsatisfactory. Can anyone please suggest any other …
Category: Data Science

Does PCA helps to include all the variables even if there is high collinearity among variables?

I have a dataset that has high collinearity among variables. When I created the linear regression model, I could not include more than five variables ( I eliminated the feature whenever VIF>5). But I need to have all the variables in the model and find their relative importance. Is there any way around it?. I was thinking about doing PCA and creating models on principal components. Does it help?.
Category: Data Science

PCA and orange software

I am analysing if 15 books can be grouped according to 6 variables (of the 15 books, 2 are written by an author, 6 by an other one, and 7 by an other one). I counted the number of occurrences of the variables and I calculated the percentage. Then I used Orange software to use PCA. I uploaded the file. selected the columns and row. And when it comes to PCA the program asks me if I want to normalize …
Category: Data Science

How can I adjust the legend when visualizing clusters in two dimensions?

How can I change the legend as we can see now the legend has some cluster numbers missing. How can I adjust the legend so that it can show all the cluster numbers (such as Cluster 1, Cluster 2 etc, no it's only 0 3 6 9)? (codes I followed this link: Perform k-means clustering over multiple columns) kmeans = KMeans(n_clusters=10) y2 = kmeans.fit_predict(scaled_data) reduced_scaled_data = PCA(n_components=2).fit_transform(scaled_data) results = pd.DataFrame(reduced_scaled_data,columns=['pca1','pca2']) sns.scatterplot(x="pca1", y="pca2", hue=y2, data=results) #y2 is my cluster number plt.title('K-means …
Category: Data Science

Python sklearn PCA transform function output does not match

I am computing PCA on some data using 10 components and using 3 out of 10 as: transformer = PCA(n_components=10) trained=transformer.fit(train) one=numpy.matmul(train,numpy.transpose(trained.components_[:3,:])) Here trained.components_[:3,:] are: array([[-1.43311999e-03, 1.65635865e-01, 5.49189565e-01, 5.26069645e-02, 2.42638594e-01, 1.20957807e-02, 1.30595572e-01, 1.09279646e-02, 7.21299808e-03, -2.79057934e-02, -1.14834589e-02, 5.06289160e-01, 5.42890317e-01, 8.50422194e-02, 1.80935205e-01, 2.98473275e-05, -8.04537378e-04], [-1.05419313e-02, 3.09442577e-01, -8.15534934e-02, 4.28621520e-03, 2.93323569e-01, 3.85849115e-02, -1.16193185e-01, 4.14964652e-01, 4.16279154e-01, 2.95264788e-01, 3.28620106e-01, -2.60916490e-01, -2.37459426e-02, 1.57567265e-01, 4.02873342e-01, 5.28389303e-05, -2.07920000e-03], [ 8.63072772e-03, -3.26129082e-01, 8.59869400e-02, 3.04770780e-03, -3.14966419e-01, -2.47151330e-02, 1.05987767e-01, 3.74235953e-01, 3.75747065e-01, 2.76035253e-01, 3.18273743e-01, 3.02423861e-01, 2.76535177e-02, -1.51485057e-01, -4.48558170e-01, -8.83328996e-05, -2.25542180e-03]]) and using only …
Category: Data Science

Whether to use LDA or QDA

I'm trying to determine whether it's best to use linear or quadratic discriminant analysis for an analysis that I'm working on. It's my understanding that one of the motivations for using QDA over LDA is that it deals better with circumstances in which the variance of the predictors is not constant across the classes being predicted. This is true for my data, however I intend to carry out principal components analysis beforehand. Because this PCA will involve scaling/normalising the variables, …
Category: Data Science

PCA huge parts of missing data filling

I’m performing PCA on different time series’ and then using K Means clustering to try and group together common factors. The issue I’m facing is that some of the factors come in and out of the time series. For example I may have 12 years in total of data points, some factors may exist for the entire 12 years but some may dip in and out (active for the first two years, inactive for three years, active for the rest …
Category: Data Science

Hard time finding literature on feature clustering using Principal Component Analysis

Im new to StackExchange, so i am sorry if this is not the right way to ask a question on StackExhange. For my thesis I wish to propose a methode for future research on using PCA to cluster features (feature clustering) and then apply per-cluster PCA. I got the idea from this paper: this paper. But I have a hard time finding literature about PCA being used to cluster variables (not reduce variables). I could imagine that it is not …
Category: Data Science

How to interpret scikitlearn pca components output

I am trying to use PCA with scikitlearn for feature selection and there is something about PCA that I am not understanding. Can someone please fill in the blanks for me? I have a normalised dataset with 11 components. The output of PCA is: ================================================== Explained Variance: [0.29673715 0.15425831 0.10136684 0.09121094 0.09012841 0.08089791 0.07294822 0.04842635 0.0290573 0.0249145 0.01005407] ================================================== Cumulative Explained Variance: [ 29.67371513 45.09954607 55.23623054 64.35732446 73.37016542 81.45995668 88.75477905 93.5974138 96.50314332 98.9945929 100. ] ================================================== The graph of the …
Category: Data Science

What best/correct algorithm/procedure to cluster a dataset with a lot 0's?

I'm new to statistics so sorry any major lack of knowledge in the topic, just doing a project for graduation. I'm trying to cluster a Health dataset containing Diseases(3456) and Symptoms(25) grouping them considering the number of events occurred. My concern is that a lot of the values are 0 simple because some diseases didn't show that particularly symptom, for example (I made up the values for now): So, I was wondering what was the best way to cluster this …
Category: Data Science

Using PCA to cluster multidimensional data (RFM variables)

So i am performing k-means clustering on RFM variables (Recency, Frequency, Monetary). The RFM variables are in the form of quantiles (1-4). I used PCA and found the PCA components. I then used the elbow method to find the optimal number of clusters and then I use it in the k-means algorithm. Could anyone guide me if this is a correct method? Further, the clusters I get range on the graph, their axis ranges from -3 to 3 and I …
Category: Data Science

Eigenvectors of points on a straight line PCA1 and PCA2

let immagine that I have 3 points and they are all on a sloped straigh line such as (-4, -2) (0,0) (2,1) this is straight line passing from the origin. Intuitevely pca2 would be 0 as I have not up spread in the data. And PCA1 would be the max variance of the data on the line. Is my intuition correct for pca2 = 0? is there any intuitive way to calculate PCA1 for this case scenario?
Topic: pca
Category: Data Science

Dimensionality Reduction of Curved Structural Data

I have been using PCA dimensionality reduction on datasets that are quite linear and now I am tasked with the same on datasets that are largely curved in space. Imagine a noisy sine wave for simplicity. Is PCA still useful in this scenario? If not, what is a more appropriate dimensionality reduction method?
Category: Data Science

Guidance needed with dimension reduction for clustering - some numerical, lots of categorical data

I've my data in a Pandas df with 25.000 rows and 1.500 columns without any NaNs. Of the columns about 30 contain numerical data which I standardized with StandardScaler(). The rest are cols with binary values which originated from cols with categorical data. (used pd.get_dummies() for this) Now I'd like to reduce the dimensions. I'm already running from sklearn.decomposition import PCA pca = PCA(n_components=2) pca.fit(df) for three hours and I asked my self if my approach was correct. I also …
Category: Data Science

How to tell how much information I lose when I simplify the graph data structure with respect to unsimplified graph?

I have the following problem: I have some sort of data (that I can't publish here, but they are in the form of points with XYZ coordinates) and I can represent them as a collection of graphs i.e. $Q = \{G_1, G_2 ... G_t\}$, where for every node there is an associated set of features, e.g. node $u_i$ has feature vector $\mathcal{F}_i$ and the features are changing between graphs (but graph structure does not). The resulting graphs are big in …
Topic: pca graphs
Category: Data Science

Is it always possible to get well-defined clusters from the data?

I have TV watching data and I have been trying to cluster it to get different sets of watchers. My dataset consists of 64 features (such as total watching time, percent of ads skipped, movies vs. shows, etc.). All the variables are either numerical or binary. But no matter how I treat them (normalize them, standardized, leave them as is, take a subset of features, etc.), I always end up getting pictures similar to this: This particular picture was constructed …
Category: Data Science

Using PCA for Dimensionality Expansion

I was trying to use t-SNE algorithm for dimensionality reduction and I know this was not the primary usage of this algorithm and not recommended. I saw an implementation here. I am not convinced about this implementation on t-SNE. The algorithm works like this: Given a training dataset and a test dataset, combine the 2 together into one full dataset Run t-SNE on the full dataset (excluding the target variable) Take the output of the t-SNE and add it as …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.