PCA, Better performances with 300 components rather than 400 components : why?

I am building this content based image retrieval system. I basically extract feature maps of size 1024x1x1 using any backbone. I then proceed to apply PCA on the extracted features in order to reduce dimensions. I use either nb_components=300 or nb_components=400. I achieved these performances (dim_pca means no pca applied) Is there any explanation of why k=300 works better then k=400 ? If I understand, k=400 is suppose to explain more variance then k=300 ? Is it my mistake or …
Category: Data Science

Multidimensional scaling producing different results for different seeds

I took the data from here and wanted to play around with multidimensional scaling with this data. The data looks like this: In particular, I want to plot the cities in a 2D space, and see how much it matches their real locations in a geographic map from just the information about how far they are from each other, without any explicit latitude and longitude information. This is my code: import pandas as pd import numpy as np from sklearn …
Category: Data Science

What is the Purpose of Feature Selection

I have a small medical dataset (200 samples) that contains only 6 cases of the condition I am trying to predict using machine learning. So far, the dataset is not proving useful for predicting the target variable and is resulting in models with 0% recall and precision, probably due to how small the dataset is. However, in order to learn from the dataset, I applied Feature Selection techniques to deduct what features are useful in predicting the target variable and …
Category: Data Science

Which algorithm can be used to reduce dimension of multiple time series?

In my dataset, a data point is essentially a Time series of 6 feature over a year per month so in all, it results in 6*12=72 features. I need to find class outliers so I perform dimensionality reduction hoping the difference in data is maintained and then apply k-means clustering and compute distance. For dimensionality reduction I have tried PCA and simple autoencoder to reduce dimension from 72 to 6 but results are unsatisfactory. Can anyone please suggest any other …
Category: Data Science

Why Do a Set of 3 Clusters Across 1 Dimension and a Set of 3 Clusters Across 2 Dimensions Form 9 Apparent Clusters in 3 Dimensions?

I am sorry if this is a well-known phenomenon but I can't quite wrap my head around this. I have a related question: How To Develop Cluster Models Where the Clusters Occur Along Subsets of Dimensions in Multidimensional Data?. There are good answers for feature selection and cluster metrics but I think this phenomenon deserves special attention. I have simulated 3 clusters along 1 dimension, and then simulated 3 clusters along 2 dimensions, and then combined them into a dataset …
Category: Data Science

How to Approach Linear Machine-Learning Model When Input Variables are Inconsistent

Disclaimer: I'm relatively new to the data science and ML world -- still trying to get a firm grasp on the fundamentals. I'm trying to overcome a regression challenge involving a large, multi-dimensional dataset, but am hitting a roadblock when it comes to my input data. This dataset consists of a few key input criteria: [FLOW, TEMP, PRESSURE, VOLTAGE_A] and a single output variable, VOLTAGE_B (this is what I'm hoping to effectively model and predict). I'm able to handle this …
Category: Data Science

Accuracy drops when adding a fully connected layer for dimensionality reduction to a ResNet50

I'm training a ResNet50 for image classification and I'm interested in decreasing the dimensionality of the embedded layer, in order to apply some clustering techniques. The suggested dimension is something in the range 64-256, so I thought I'd start from 128. I'm using PyTorch. After loading the pretrained ResNet50 from the official release I would usually do this: model = t.load(cfg.resnet_path) model.fc = nn.Sequential(nn.Linear(in_features = 2048, out_features = num_classes, bias = True)) Everything worked and I reached an accuracy of …
Category: Data Science

An autoencoder setup for anomaly detection

I am doing anomaly detection using machine learning. i have tried different models like isolation forest, SVM and KNN. The maximum accuracy that I can get from each of them is $80\%$ accordind to my dataset which contains $5$ features and $4000$ data samples, $18\%$ of them are anomalous. When I use autoencoder and I adjust the proper reconstruction loss threshold I can get $92\%$ accuracy but the hidden layers setup of the autoencoder does not seems right despite the …
Category: Data Science

Illustrating the dimensionality reduction done by a classification or regression model

Tl;DR: You can predict something, but how do you explain the prediction? Your usual classification/regression setup Lets say the data is a classic regression/classification problem: several numerical columns, several nominal columns, and an event which we are trying to predict: user1, age:18, wealth:20000, likes:tomatoes, isInBigCity:yes, hasClicked:yes user2, age:25, wealth:24000, likes:carrots , isInBigCity:no , hasClicked:no ... With the help of Random Forests, SVM, Logistic Regression, Deep Neural Network, or some other method we export a model that can output a probability …
Category: Data Science

Dimensionality Reduction of Curved Structural Data

I have been using PCA dimensionality reduction on datasets that are quite linear and now I am tasked with the same on datasets that are largely curved in space. Imagine a noisy sine wave for simplicity. Is PCA still useful in this scenario? If not, what is a more appropriate dimensionality reduction method?
Category: Data Science

Single scalar from vector

I am aware that this question is very general, but I found this question and it made me curious. What are the sensible ways that you can think of to derive a single scalar value from a vector? Of course this procedure will vary a lot according to your data and your purpose and will result in an information loss, but what are the alternatives? For now, this is what I have (from linked question and mine): Length. Compute the …
Category: Data Science

Clustering Algorithm + Euclidean Distance to find similarities

Goal: Create a tool that recommends similar players based on their statistical profile Process: (1) Standardize data (2) UMAP to reduce dimensionality (c. 50 features) (3) First-Stage Clustering: GMM to create macro clusters of players (4) Second-Stage Clustering: GMM to create micro clusters of each macro cluster with different features based on their position (e.g. only 10/50 that are relevant) (5) Calculate Euclidean Distance using PCA (UMAP led to weird results) Question: How good/reasoanble is this approach on a scale …
Category: Data Science

Guidance needed with dimension reduction for clustering - some numerical, lots of categorical data

I've my data in a Pandas df with 25.000 rows and 1.500 columns without any NaNs. Of the columns about 30 contain numerical data which I standardized with StandardScaler(). The rest are cols with binary values which originated from cols with categorical data. (used pd.get_dummies() for this) Now I'd like to reduce the dimensions. I'm already running from sklearn.decomposition import PCA pca = PCA(n_components=2) pca.fit(df) for three hours and I asked my self if my approach was correct. I also …
Category: Data Science

Encoding very large dataset to one-hot encoding matrix

I have a dataset of text corpus where the unique characters in the text are around 400. The maximum row length is 3000. We have 20000 rows, so we would have like $2000\times3000\times400$ one-hot encoding matrix, which lead to memory error as the size needed jumped over 900 GB of RAM. There are dimensionality reduction techniques such as PCA and others, but other than that what would you recommend in my case please to overcome this issue? The text is …
Category: Data Science

Need suggestions on customer segmentation

I have been tasked with performing customer segmentation for a Business to business use case based on customer purchase history. Can experts provide me inputs on how do I proceed with customer segmentation based on the following dataset Dataset details which have been provided to me Hierarchy 3,4,5 define the categories under which the product falls Edit: Also need inputs on how do i select features for my clustering algorithm?
Category: Data Science

How to choose Recursive Feature Elimination parameters

in my project I have >900 features and I thought to use Recursive Feature Elimination algorithm to reduce the dimensionality of my problem (in order to improve the accuracy). But I can't figure out how to choose the RFE parameters (estimator and the number of parameters to select). Should I use model selection techniques in this case as well? Do you have any advice?
Category: Data Science

How to deal with disconnected components in isomap?

While creating a nearest neighbor graph for isomap, there is a possibility that the graph is disconnected. In this case finding graph distances between all pairs of points will not be possible. Are there any simple methods other than iteratively changing the nearest neighbor search parameters till we get a connected graph?
Category: Data Science

Using PCA for Dimensionality Expansion

I was trying to use t-SNE algorithm for dimensionality reduction and I know this was not the primary usage of this algorithm and not recommended. I saw an implementation here. I am not convinced about this implementation on t-SNE. The algorithm works like this: Given a training dataset and a test dataset, combine the 2 together into one full dataset Run t-SNE on the full dataset (excluding the target variable) Take the output of the t-SNE and add it as …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.