I struggle with figuring out how to proceed with taking PCA into production in order to test my Models with unknown samples. I'm using both an One-Hot-Encoding an an TF-IDF in order to classify my elements with various models, mainly KNN. I know i can use the pretrained One-Hot-Encoder and the TF-IDF encoder in order to encode the new elements in order to match the final feature Vector. Since these feature vectors become very large i use an PCA in …
I am learning PCA and the question is the following: can be PCA applied to a dataset containing both numerical continuous and numerical discrete variables? Thank you
In my dataset, a data point is essentially a Time series of 6 feature over a year per month so in all, it results in 6*12=72 features. I need to find class outliers so I perform dimensionality reduction hoping the difference in data is maintained and then apply k-means clustering and compute distance. For dimensionality reduction I have tried PCA and simple autoencoder to reduce dimension from 72 to 6 but results are unsatisfactory. Can anyone please suggest any other …
I have a dataset that has high collinearity among variables. When I created the linear regression model, I could not include more than five variables ( I eliminated the feature whenever VIF>5). But I need to have all the variables in the model and find their relative importance. Is there any way around it?. I was thinking about doing PCA and creating models on principal components. Does it help?.
I am analysing if 15 books can be grouped according to 6 variables (of the 15 books, 2 are written by an author, 6 by an other one, and 7 by an other one). I counted the number of occurrences of the variables and I calculated the percentage. Then I used Orange software to use PCA. I uploaded the file. selected the columns and row. And when it comes to PCA the program asks me if I want to normalize …
How can I change the legend as we can see now the legend has some cluster numbers missing. How can I adjust the legend so that it can show all the cluster numbers (such as Cluster 1, Cluster 2 etc, no it's only 0 3 6 9)? (codes I followed this link: Perform k-means clustering over multiple columns) kmeans = KMeans(n_clusters=10) y2 = kmeans.fit_predict(scaled_data) reduced_scaled_data = PCA(n_components=2).fit_transform(scaled_data) results = pd.DataFrame(reduced_scaled_data,columns=['pca1','pca2']) sns.scatterplot(x="pca1", y="pca2", hue=y2, data=results) #y2 is my cluster number plt.title('K-means …
I'm trying to determine whether it's best to use linear or quadratic discriminant analysis for an analysis that I'm working on. It's my understanding that one of the motivations for using QDA over LDA is that it deals better with circumstances in which the variance of the predictors is not constant across the classes being predicted. This is true for my data, however I intend to carry out principal components analysis beforehand. Because this PCA will involve scaling/normalising the variables, …
I am looking for a public data-set of images that differ from each other only slightly, so that after applying PCA they can be reconstructed with a small error from very few PCA coefficients. It can be any type of images, the purpose is only to demonstrate an extreme example of PCA.
I’m performing PCA on different time series’ and then using K Means clustering to try and group together common factors. The issue I’m facing is that some of the factors come in and out of the time series. For example I may have 12 years in total of data points, some factors may exist for the entire 12 years but some may dip in and out (active for the first two years, inactive for three years, active for the rest …
Im new to StackExchange, so i am sorry if this is not the right way to ask a question on StackExhange. For my thesis I wish to propose a methode for future research on using PCA to cluster features (feature clustering) and then apply per-cluster PCA. I got the idea from this paper: this paper. But I have a hard time finding literature about PCA being used to cluster variables (not reduce variables). I could imagine that it is not …
I am trying to use PCA with scikitlearn for feature selection and there is something about PCA that I am not understanding. Can someone please fill in the blanks for me? I have a normalised dataset with 11 components. The output of PCA is: ================================================== Explained Variance: [0.29673715 0.15425831 0.10136684 0.09121094 0.09012841 0.08089791 0.07294822 0.04842635 0.0290573 0.0249145 0.01005407] ================================================== Cumulative Explained Variance: [ 29.67371513 45.09954607 55.23623054 64.35732446 73.37016542 81.45995668 88.75477905 93.5974138 96.50314332 98.9945929 100. ] ================================================== The graph of the …
I'm new to statistics so sorry any major lack of knowledge in the topic, just doing a project for graduation. I'm trying to cluster a Health dataset containing Diseases(3456) and Symptoms(25) grouping them considering the number of events occurred. My concern is that a lot of the values are 0 simple because some diseases didn't show that particularly symptom, for example (I made up the values for now): So, I was wondering what was the best way to cluster this …
So i am performing k-means clustering on RFM variables (Recency, Frequency, Monetary). The RFM variables are in the form of quantiles (1-4). I used PCA and found the PCA components. I then used the elbow method to find the optimal number of clusters and then I use it in the k-means algorithm. Could anyone guide me if this is a correct method? Further, the clusters I get range on the graph, their axis ranges from -3 to 3 and I …
let immagine that I have 3 points and they are all on a sloped straigh line such as (-4, -2) (0,0) (2,1) this is straight line passing from the origin. Intuitevely pca2 would be 0 as I have not up spread in the data. And PCA1 would be the max variance of the data on the line. Is my intuition correct for pca2 = 0? is there any intuitive way to calculate PCA1 for this case scenario?
I have been using PCA dimensionality reduction on datasets that are quite linear and now I am tasked with the same on datasets that are largely curved in space. Imagine a noisy sine wave for simplicity. Is PCA still useful in this scenario? If not, what is a more appropriate dimensionality reduction method?
I've my data in a Pandas df with 25.000 rows and 1.500 columns without any NaNs. Of the columns about 30 contain numerical data which I standardized with StandardScaler(). The rest are cols with binary values which originated from cols with categorical data. (used pd.get_dummies() for this) Now I'd like to reduce the dimensions. I'm already running from sklearn.decomposition import PCA pca = PCA(n_components=2) pca.fit(df) for three hours and I asked my self if my approach was correct. I also …
I have the following problem: I have some sort of data (that I can't publish here, but they are in the form of points with XYZ coordinates) and I can represent them as a collection of graphs i.e. $Q = \{G_1, G_2 ... G_t\}$, where for every node there is an associated set of features, e.g. node $u_i$ has feature vector $\mathcal{F}_i$ and the features are changing between graphs (but graph structure does not). The resulting graphs are big in …
I have TV watching data and I have been trying to cluster it to get different sets of watchers. My dataset consists of 64 features (such as total watching time, percent of ads skipped, movies vs. shows, etc.). All the variables are either numerical or binary. But no matter how I treat them (normalize them, standardized, leave them as is, take a subset of features, etc.), I always end up getting pictures similar to this: This particular picture was constructed …
I was trying to use t-SNE algorithm for dimensionality reduction and I know this was not the primary usage of this algorithm and not recommended. I saw an implementation here. I am not convinced about this implementation on t-SNE. The algorithm works like this: Given a training dataset and a test dataset, combine the 2 together into one full dataset Run t-SNE on the full dataset (excluding the target variable) Take the output of the t-SNE and add it as …