dimensionality-reduction

PCA, Better performances with 300 components rather than 400 components : why?

Valentin Fontanger

2022年6月3日 08:57

I am building this content based image retrieval system. I basically extract feature maps of size 1024x1x1 using any backbone. I then proceed to apply PCA on the extracted features in order to reduce dimensions. I use either nb_components=300 or nb_components=400. I achieved these performances (dim_pca means no pca applied) Is there any explanation of why k=300 works better then k=400 ? If I understand, k=400 is suppose to explain more variance then k=300 ? Is it my mistake or …

Topic: computer-vision dimensionality-reduction machine-learning

Category: Data Science

Multidimensional scaling producing different results for different seeds

Kristada673

2022年5月28日 01:03

I took the data from here and wanted to play around with multidimensional scaling with this data. The data looks like this: In particular, I want to plot the cities in a 2D space, and see how much it matches their real locations in a geographic map from just the information about how far they are from each other, without any explicit latitude and longitude information. This is my code: import pandas as pd import numpy as np from sklearn …

Topic: geospatial python dimensionality-reduction

Category: Data Science

What is the Purpose of Feature Selection

sums22

2022年5月27日 15:00

I have a small medical dataset (200 samples) that contains only 6 cases of the condition I am trying to predict using machine learning. So far, the dataset is not proving useful for predicting the target variable and is resulting in models with 0% recall and precision, probably due to how small the dataset is. However, in order to learn from the dataset, I applied Feature Selection techniques to deduct what features are useful in predicting the target variable and …

Topic: feature-selection python dimensionality-reduction machine-learning

Category: Data Science

Which algorithm can be used to reduce dimension of multiple time series?

Faiz Kidwai

2022年5月26日 11:02

In my dataset, a data point is essentially a Time series of 6 feature over a year per month so in all, it results in 6*12=72 features. I need to find class outliers so I perform dimensionality reduction hoping the difference in data is maintained and then apply k-means clustering and compute distance. For dimensionality reduction I have tried PCA and simple autoencoder to reduce dimension from 72 to 6 but results are unsatisfactory. Can anyone please suggest any other …

Topic: pytorch pca autoencoder python dimensionality-reduction

Category: Data Science

Why Do a Set of 3 Clusters Across 1 Dimension and a Set of 3 Clusters Across 2 Dimensions Form 9 Apparent Clusters in 3 Dimensions?

from keras import michael

2022年5月25日 18:45

I am sorry if this is a well-known phenomenon but I can't quite wrap my head around this. I have a related question: How To Develop Cluster Models Where the Clusters Occur Along Subsets of Dimensions in Multidimensional Data?. There are good answers for feature selection and cluster metrics but I think this phenomenon deserves special attention. I have simulated 3 clusters along 1 dimension, and then simulated 3 clusters along 2 dimensions, and then combined them into a dataset …

Topic: python dimensionality-reduction clustering

Category: Data Science

How to Approach Linear Machine-Learning Model When Input Variables are Inconsistent

Austin Prater

2022年5月24日 23:40

Disclaimer: I'm relatively new to the data science and ML world -- still trying to get a firm grasp on the fundamentals. I'm trying to overcome a regression challenge involving a large, multi-dimensional dataset, but am hitting a roadblock when it comes to my input data. This dataset consists of a few key input criteria: [FLOW, TEMP, PRESSURE, VOLTAGE_A] and a single output variable, VOLTAGE_B (this is what I'm hoping to effectively model and predict). I'm able to handle this …

Topic: python-3.x linear-regression scikit-learn dimensionality-reduction

Category: Data Science

Accuracy drops when adding a fully connected layer for dimensionality reduction to a ResNet50

Gello

2022年5月24日 07:30

I'm training a ResNet50 for image classification and I'm interested in decreasing the dimensionality of the embedded layer, in order to apply some clustering techniques. The suggested dimension is something in the range 64-256, so I thought I'd start from 128. I'm using PyTorch. After loading the pretrained ResNet50 from the official release I would usually do this: model = t.load(cfg.resnet_path) model.fc = nn.Sequential(nn.Linear(in_features = 2048, out_features = num_classes, bias = True)) Everything worked and I reached an accuracy of …

Topic: pytorch cnn python dimensionality-reduction clustering

Category: Data Science

An autoencoder setup for anomaly detection

Riva11

2022年5月19日 16:36

I am doing anomaly detection using machine learning. i have tried different models like isolation forest, SVM and KNN. The maximum accuracy that I can get from each of them is $80\%$ accordind to my dataset which contains $5$ features and $4000$ data samples, $18\%$ of them are anomalous. When I use autoencoder and I adjust the proper reconstruction loss threshold I can get $92\%$ accuracy but the hidden layers setup of the autoencoder does not seems right despite the …

Topic: autoencoder anomaly-detection dimensionality-reduction

Category: Data Science

Illustrating the dimensionality reduction done by a classification or regression model

BenoitParis

2022年5月6日 12:02

Tl;DR: You can predict something, but how do you explain the prediction? Your usual classification/regression setup Lets say the data is a classic regression/classification problem: several numerical columns, several nominal columns, and an event which we are trying to predict: user1, age:18, wealth:20000, likes:tomatoes, isInBigCity:yes, hasClicked:yes user2, age:25, wealth:24000, likes:carrots , isInBigCity:no , hasClicked:no ... With the help of Random Forests, SVM, Logistic Regression, Deep Neural Network, or some other method we export a model that can output a probability …

Topic: random-forest classification svm dimensionality-reduction machine-learning

Category: Data Science

Would you ever chose t-SNE over UMAP?

Adrian Evensen

2022年4月29日 17:01

UMAP is both faster and captures the global structure better than t-SNE when visualizing high-dimensional data. Is there ever a situation where you would pick t-SNE over UMAP?

Topic: tsne visualization dimensionality-reduction

Category: Data Science

Decision trees and Curse of Dimensionality

Qwerto

2022年4月26日 19:03

Since decision tree algorithm splits the training dataset one feature at a time, how the heck is possibly that it suffers from curse of dimensionality ?

Topic: decision-trees classification dimensionality-reduction

Category: Data Science

Dimensionality Reduction of Curved Structural Data

Bryon

2022年4月24日 21:21

I have been using PCA dimensionality reduction on datasets that are quite linear and now I am tasked with the same on datasets that are largely curved in space. Imagine a noisy sine wave for simplicity. Is PCA still useful in this scenario? If not, what is a more appropriate dimensionality reduction method?

Topic: pca dimensionality-reduction

Category: Data Science

Single scalar from vector

a_gdevr

2022年4月24日 11:22

I am aware that this question is very general, but I found this question and it made me curious. What are the sensible ways that you can think of to derive a single scalar value from a vector? Of course this procedure will vary a lot according to your data and your purpose and will result in an information loss, but what are the alternatives? For now, this is what I have (from linked question and mine): Length. Compute the …

Topic: dimensionality-reduction

Category: Data Science

Clustering Algorithm + Euclidean Distance to find similarities

Lasnik23

2022年4月23日 07:20

Goal: Create a tool that recommends similar players based on their statistical profile Process: (1) Standardize data (2) UMAP to reduce dimensionality (c. 50 features) (3) First-Stage Clustering: GMM to create macro clusters of players (4) Second-Stage Clustering: GMM to create micro clusters of each macro cluster with different features based on their position (e.g. only 10/50 that are relevant) (5) Calculate Euclidean Distance using PCA (UMAP led to weird results) Question: How good/reasoanble is this approach on a scale …

Topic: preprocessing python dimensionality-reduction machine-learning

Category: Data Science

Guidance needed with dimension reduction for clustering - some numerical, lots of categorical data

zinyosrim

2022年4月22日 09:08

I've my data in a Pandas df with 25.000 rows and 1.500 columns without any NaNs. Of the columns about 30 contain numerical data which I standardized with StandardScaler(). The rest are cols with binary values which originated from cols with categorical data. (used pd.get_dummies() for this) Now I'd like to reduce the dimensions. I'm already running from sklearn.decomposition import PCA pca = PCA(n_components=2) pca.fit(df) for three hours and I asked my self if my approach was correct. I also …

Topic: pca scikit-learn pandas python dimensionality-reduction

Category: Data Science

Encoding very large dataset to one-hot encoding matrix

Avv

2022年4月21日 16:26

I have a dataset of text corpus where the unique characters in the text are around 400. The maximum row length is 3000. We have 20000 rows, so we would have like $2000\times3000\times400$ one-hot encoding matrix, which lead to memory error as the size needed jumped over 900 GB of RAM. There are dimensionality reduction techniques such as PCA and others, but other than that what would you recommend in my case please to overcome this issue? The text is …

Topic: one-hot-encoding dimensionality-reduction

Category: Data Science

Need suggestions on customer segmentation

kumar

2022年4月18日 09:01

I have been tasked with performing customer segmentation for a Business to business use case based on customer purchase history. Can experts provide me inputs on how do I proceed with customer segmentation based on the following dataset Dataset details which have been provided to me Hierarchy 3,4,5 define the categories under which the product falls Edit: Also need inputs on how do i select features for my clustering algorithm?

Topic: dimensionality-reduction clustering

Category: Data Science

How to choose Recursive Feature Elimination parameters

Giorgio Martinez

2022年4月15日 20:26

in my project I have >900 features and I thought to use Recursive Feature Elimination algorithm to reduce the dimensionality of my problem (in order to improve the accuracy). But I can't figure out how to choose the RFE parameters (estimator and the number of parameters to select). Should I use model selection techniques in this case as well? Do you have any advice?

Topic: rfe model-selection dimensionality-reduction

Category: Data Science

How to deal with disconnected components in isomap?

pauli

2022年4月10日 22:03

While creating a nearest neighbor graph for isomap, there is a possibility that the graph is disconnected. In this case finding graph distances between all pairs of points will not be possible. Are there any simple methods other than iteratively changing the nearest neighbor search parameters till we get a connected graph?

Topic: dimensionality-reduction

Category: Data Science

Using PCA for Dimensionality Expansion

yeyosef

2022年4月8日 22:50

I was trying to use t-SNE algorithm for dimensionality reduction and I know this was not the primary usage of this algorithm and not recommended. I saw an implementation here. I am not convinced about this implementation on t-SNE. The algorithm works like this: Given a training dataset and a test dataset, combine the 2 together into one full dataset Run t-SNE on the full dataset (excluding the target variable) Take the output of the t-SNE and add it as …

Topic: pca dimensionality-reduction

Category: Data Science

About