k-means

Customer Segmentation: Should I use a variable, representing a product, that is unpopular in the dataset for K-Means Clustering?

Obaid Khan

2022年6月4日 08:05

I am working with a data set that, besides customer age and income, tells the balance a customer has in different type of bank accounts: Checking, Shares, Investment, Savings, Deposit, Mortgage, Loan, and Certificates. For accounts other than Checking, 0 represents that the account does not exist for the customers. There are 9800 customer observations with roughly 6000 checking accounts and 4000 savings accounts. For the others, the observations are less than 300. I have to use K-Means Clustering analysis …

Topic: statistics r k-means clustering

Category: Data Science

Theoretical work on validity of restricting movement of Centroid of K-Mean

Joe89

2022年6月2日 10:03

I recently received a manuscript for review in which author used ~1000 "fake" data points, so that the final centroid of K-mean stays within the required range. Neither me nor the author seems to have background in data science and the paper is more of application into our research area. I have tried to find published work related to such method of restricting k-mean centers, but failed to do so. However, on simple logic, it seems like valid way, so …

Topic: k-means

Category: Data Science

How to improve the result? Should I remove the columns?

Agat0

2022年5月30日 08:36

I am using this dataset, the target column is the last one which is 'DEATH_EVENT', I have separated this last one. I am using KMeans to calculate the number of hits and misses. The result is quite bad, I think I should delete some columns or create a loop that deletes. What would you do? import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split X = np.genfromtxt('heart_failure_clinical_records_dataset.csv', delimiter=',') X = np.delete(X, 0, 0) train, test = train_test_split(X, …

Topic: deep-learning python k-means

Category: Data Science

Find the shared properties of cluster samples

qwertzuiop

2022年5月29日 06:04

I have a dataset which contains ~15 features. With the elbow method, I found out that the optimal number of clusters is probably four. Therefore, I applied the K-means algorithm with four clusters. Now, I would like to understand why these clusters have been formed the way they are. In other words, I would like to identify the shared properties of the points of a specific cluster. My idea is the following: Let's pretend that C1 are the coordinates of …

Topic: k-means clustering

Category: Data Science

Calculating new centroids when the centroids are chosen at random

Sean

2022年5月28日 12:43

When given two random points which are not instances in the dataset should I include the centroids in my calculations for the new centroids? For example in this link they are using the starting centroids which are apart of the dataset to calculate the mean for the new centroids. But if given random x and y coordinates lets say [2,1] and [3,2] which are not apart of the dataset do I also include them or just the instances in the …

Topic: k-means clustering

Category: Data Science

Interpreting cluster variables - raw vs scaled

The Great

2022年5月27日 12:35

I already referred these posts here and here. I also posted here but since there is no response, am posting here. Currently, I am working on customer segmentation using their purchase data. So, my data has below info for each customer Based on the above linked posts I see that for clustering, we have to scale the variables if they are in different units etc. But if I scale/normalize all of them to uniform scale, wouldn't I lose the information …

Topic: predictive-modeling k-means clustering data-mining machine-learning

Category: Data Science

Perform clustering from a similarity matrix

Michael Pulis

2022年5月26日 07:06

I have a list of songs for each of which I have extracted a feature vector. I calculated a similarity score between each vector and stored this in a similarity matrix. I would like to cluster the songs based on this similarity matrix to attempt to identify clusters or sort of genres. I have used the networkx package to create a force-directed graph from the similarity matrix, using the spring layout. Then I used KMeans clustering on the position of …

Topic: python k-means clustering

Category: Data Science

Kmeans cluster validation when I have labeled test data

Aniket Bote

2022年5月24日 02:01

I'm trying to implement the unsupervised k-means algorithm for sentiment analysis of imdb movie dataset created by stanford. The steps that I followed is : 1) Load the comments 2) Apply tokenization and stemmetion ,use tf-idf algo to create tfidf matrix. 3) Use k-means algo to divide the data into 2 clusters. My problem is how do I validate the the clusters I have labeled test data. I want to check if all the negative examples go in one cluster …

Topic: unsupervised-learning sentiment-analysis python k-means

Category: Data Science

confusing regarding to kmeans clulstering for data correlation

daxu

2022年5月21日 14:00

I am trying to think through my process before doing any real coding. However, got really confused easily. Say I have 100 instruments and I know their price movements every day for a year. So I can create a movement matrix A =[[I1-1, I2-1, .... I100-1], (I1-1 is price for instrument 1 on day 1) [I1-2, I2-2, .... I100-2], .... [I1-365, I-2365, .... I100-365] ] Then for each instrument, I can calculate a price movement correlation between other instruments for …

Topic: python k-means clustering machine-learning

Category: Data Science

How to measure F1 score and NMI for clustering task?

Jay Patel

2022年5月21日 13:04

I see the authors of this paper are measuring F1 and NMI scores to measure the clustering quality. However, I don't understand the algorithm of how they actually measure it. See the Evaluation Section. Although I have looked at the code, I am not sure about the actual algorithm.

Topic: mutual-information evaluation k-means clustering

Category: Data Science

KMeans clusterization on documents

Andrea Moro

2022年5月20日 15:08

Whether correct or not, I'm not able to judge being myself in the early days of the Data Science. However, I have applied a Kmeans on a corpus where some random documents (very short sentences) have been added. These have been vectiorized so to be suitable. With clusterization results at hands, I was somehow expecting the vectors (keyword) to fall only in one cluster at a time (and no more than that). This is not the case. In some circumstances, …

Topic: k-means clustering

Category: Data Science

How To Develop Cluster Models Where the Clusters Occur Along Subsets of Dimensions in Multidimensional Data?

from keras import michael

2022年5月17日 21:05

I have been exploring clustering algorithms (K-Means, K-Medoids, Ward Agglomerative, Gaussian Mixture Modeling, BIRCH, DBSCAN, OPTICS, Common Nearest-Neighbour Clustering) with multidimensional data. I believe that the clusters in my data occur across different subsets of the features rather than occurring across all features, and I believe that this impacts the performance of the clustering algorithms. To illustrate, below is Python code for a simulated dataset: ## Simulate a dataset. import numpy as np, matplotlib.pyplot as plt from sklearn.cluster import KMeans …

Topic: feature-selection python k-means clustering

Category: Data Science

semantic segmentation using kmeans or mean shift

Bahy

2022年5月13日 20:06

i know what semantic segmentation is and i know how to do semantic segmentation using deep learning but my question here can i do semantic segmentation with a traditional way like kmeans or mean shift ckustering? here's what i tried to do import numpy as np import cv2 from sklearn.cluster import MeanShift, estimate_bandwidth #from skimage.color import rgb2lab #Loading original image originImg = cv2.imread('test/2019_00254.jpg') # Shape of original image originShape = originImg.shape # Converting image into array of dimension [nb of …

Topic: semantic-segmentation mean-shift python k-means

Category: Data Science

Scaling negative and positive variables when performing a k-means cluster analysis

Jeff

2022年5月12日 11:04

I'm looking to perform a k-means cluster analysis on a set of data that contains variable ranges that contain both positive and negative values. Given the rangers vary so much the data will need to be scaled, but my concern is with the variables that contain negative value ranges. Should I perform some sort of log transformation on all the date so as to scale the data to positive values. For example: Variable A: 3.4, 5.6,1.3,7.6,8.3 Variable B: 1,2,3,2,1 Variable …

Topic: k-means

Category: Data Science

Low silhouette coefficient

ItFreak

2022年5月12日 01:05

I am doing a kmeans clustering on a dataset of selling values of articles. Each article has 52 selling values (one per week). I am trying to automatically calculate the optimum amount of clusters for any unkown dataset. I tried two criteria: The elbow method and the silhouette coefficient. For the silhouette coefficient I got for 1 to 20 clusters values from 0.059 to 0.117 which is (in my opinion) extremely low (heard about a normal of about 0.7). For …

Topic: scikit-learn python k-means

Category: Data Science

PCA huge parts of missing data filling

Simon Nicholls

2022年5月11日 20:06

I’m performing PCA on different time series’ and then using K Means clustering to try and group together common factors. The issue I’m facing is that some of the factors come in and out of the time series. For example I may have 12 years in total of data points, some factors may exist for the entire 12 years but some may dip in and out (active for the first two years, inactive for three years, active for the rest …

Topic: pca data-cleaning k-means

Category: Data Science

K-Means R vs K-Means Python different cluster values generating different bar Graphs

Leo Torres

2022年5月11日 10:26

Below are 2 sets of code that do the same thing one in Python the other in R. They both graph the Kmeans the same with respect to PCA but once I do the bar chart at the end using the cluster Center the Graphs are totally different. I believe there is something wrong about the Kmeans and the cluster calculation in python. The original code was provided in R. I am trying to see why the bar chart in …

Topic: k-means clustering

Category: Data Science

Choosing attributes for k-means clustering

Borut Flis

2022年5月9日 08:46

The k-means clustering tries to minimize the within-cluster scatter and maximizing the distances between clusters. It does so on all attributes. I am learning about this method on several datasets. To illustrate, in one the datasets countries are compared based on attributes related to their Human development Index. However some of the attributes are completely unrelated to this dimension, for example total population of countries. How to deal with this attributes? As mentioned before k-means tries to minimize the scatter …

Topic: noise unsupervised-learning k-means clustering

Category: Data Science

Clustering (unsupervised learning) for uneven classes

Alex P

2022年5月6日 01:02

I am looking for an unsupervised method that can see also the points that start to look different from the majority. Which clustering techniques (I use python) can be used for such data sets? I have tried k-means but as I was expecting it has failed considerably to see such peaks.

Topic: unsupervised-learning k-means clustering

Category: Data Science

Optimal clusters for K-means not clear - any ideas?

Sandy Lee

2022年5月5日 19:24

I have a toy dataset of 10,000 strings of people's names, addresses and birthdays. As a quirk of the data collection process it is highly likely there are duplicate people caused by typos and I am trying to cluster them using K-means. I know there are easier ways of doing this, but the reason I am doing it like this is out of curiosity. In order to vectorize each person I am concatenating the strings as follows: [name][address][birthday] and then …

Topic: tfidf scikit-learn nlp k-means clustering

Category: Data Science

About