Customer Segmentation: Should I use a variable, representing a product, that is unpopular in the dataset for K-Means Clustering?

I am working with a data set that, besides customer age and income, tells the balance a customer has in different type of bank accounts: Checking, Shares, Investment, Savings, Deposit, Mortgage, Loan, and Certificates. For accounts other than Checking, 0 represents that the account does not exist for the customers. There are 9800 customer observations with roughly 6000 checking accounts and 4000 savings accounts. For the others, the observations are less than 300. I have to use K-Means Clustering analysis …
Category: Data Science

Theoretical work on validity of restricting movement of Centroid of K-Mean

I recently received a manuscript for review in which author used ~1000 "fake" data points, so that the final centroid of K-mean stays within the required range. Neither me nor the author seems to have background in data science and the paper is more of application into our research area. I have tried to find published work related to such method of restricting k-mean centers, but failed to do so. However, on simple logic, it seems like valid way, so …
Topic: k-means
Category: Data Science

How to improve the result? Should I remove the columns?

I am using this dataset, the target column is the last one which is 'DEATH_EVENT', I have separated this last one. I am using KMeans to calculate the number of hits and misses. The result is quite bad, I think I should delete some columns or create a loop that deletes. What would you do? import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split X = np.genfromtxt('heart_failure_clinical_records_dataset.csv', delimiter=',') X = np.delete(X, 0, 0) train, test = train_test_split(X, …
Category: Data Science

Find the shared properties of cluster samples

I have a dataset which contains ~15 features. With the elbow method, I found out that the optimal number of clusters is probably four. Therefore, I applied the K-means algorithm with four clusters. Now, I would like to understand why these clusters have been formed the way they are. In other words, I would like to identify the shared properties of the points of a specific cluster. My idea is the following: Let's pretend that C1 are the coordinates of …
Category: Data Science

Calculating new centroids when the centroids are chosen at random

When given two random points which are not instances in the dataset should I include the centroids in my calculations for the new centroids? For example in this link they are using the starting centroids which are apart of the dataset to calculate the mean for the new centroids. But if given random x and y coordinates lets say [2,1] and [3,2] which are not apart of the dataset do I also include them or just the instances in the …
Category: Data Science

Interpreting cluster variables - raw vs scaled

I already referred these posts here and here. I also posted here but since there is no response, am posting here. Currently, I am working on customer segmentation using their purchase data. So, my data has below info for each customer Based on the above linked posts I see that for clustering, we have to scale the variables if they are in different units etc. But if I scale/normalize all of them to uniform scale, wouldn't I lose the information …
Category: Data Science

Perform clustering from a similarity matrix

I have a list of songs for each of which I have extracted a feature vector. I calculated a similarity score between each vector and stored this in a similarity matrix. I would like to cluster the songs based on this similarity matrix to attempt to identify clusters or sort of genres. I have used the networkx package to create a force-directed graph from the similarity matrix, using the spring layout. Then I used KMeans clustering on the position of …
Category: Data Science

Kmeans cluster validation when I have labeled test data

I'm trying to implement the unsupervised k-means algorithm for sentiment analysis of imdb movie dataset created by stanford. The steps that I followed is : 1) Load the comments 2) Apply tokenization and stemmetion ,use tf-idf algo to create tfidf matrix. 3) Use k-means algo to divide the data into 2 clusters. My problem is how do I validate the the clusters I have labeled test data. I want to check if all the negative examples go in one cluster …
Category: Data Science

confusing regarding to kmeans clulstering for data correlation

I am trying to think through my process before doing any real coding. However, got really confused easily. Say I have 100 instruments and I know their price movements every day for a year. So I can create a movement matrix A =[[I1-1, I2-1, .... I100-1], (I1-1 is price for instrument 1 on day 1) [I1-2, I2-2, .... I100-2], .... [I1-365, I-2365, .... I100-365] ] Then for each instrument, I can calculate a price movement correlation between other instruments for …
Category: Data Science

KMeans clusterization on documents

Whether correct or not, I'm not able to judge being myself in the early days of the Data Science. However, I have applied a Kmeans on a corpus where some random documents (very short sentences) have been added. These have been vectiorized so to be suitable. With clusterization results at hands, I was somehow expecting the vectors (keyword) to fall only in one cluster at a time (and no more than that). This is not the case. In some circumstances, …
Category: Data Science

How To Develop Cluster Models Where the Clusters Occur Along Subsets of Dimensions in Multidimensional Data?

I have been exploring clustering algorithms (K-Means, K-Medoids, Ward Agglomerative, Gaussian Mixture Modeling, BIRCH, DBSCAN, OPTICS, Common Nearest-Neighbour Clustering) with multidimensional data. I believe that the clusters in my data occur across different subsets of the features rather than occurring across all features, and I believe that this impacts the performance of the clustering algorithms. To illustrate, below is Python code for a simulated dataset: ## Simulate a dataset. import numpy as np, matplotlib.pyplot as plt from sklearn.cluster import KMeans …
Category: Data Science

semantic segmentation using kmeans or mean shift

i know what semantic segmentation is and i know how to do semantic segmentation using deep learning but my question here can i do semantic segmentation with a traditional way like kmeans or mean shift ckustering? here's what i tried to do import numpy as np import cv2 from sklearn.cluster import MeanShift, estimate_bandwidth #from skimage.color import rgb2lab #Loading original image originImg = cv2.imread('test/2019_00254.jpg') # Shape of original image originShape = originImg.shape # Converting image into array of dimension [nb of …
Category: Data Science

Scaling negative and positive variables when performing a k-means cluster analysis

I'm looking to perform a k-means cluster analysis on a set of data that contains variable ranges that contain both positive and negative values. Given the rangers vary so much the data will need to be scaled, but my concern is with the variables that contain negative value ranges. Should I perform some sort of log transformation on all the date so as to scale the data to positive values. For example: Variable A: 3.4, 5.6,1.3,7.6,8.3 Variable B: 1,2,3,2,1 Variable …
Topic: k-means
Category: Data Science

Low silhouette coefficient

I am doing a kmeans clustering on a dataset of selling values of articles. Each article has 52 selling values (one per week). I am trying to automatically calculate the optimum amount of clusters for any unkown dataset. I tried two criteria: The elbow method and the silhouette coefficient. For the silhouette coefficient I got for 1 to 20 clusters values from 0.059 to 0.117 which is (in my opinion) extremely low (heard about a normal of about 0.7). For …
Category: Data Science

PCA huge parts of missing data filling

I’m performing PCA on different time series’ and then using K Means clustering to try and group together common factors. The issue I’m facing is that some of the factors come in and out of the time series. For example I may have 12 years in total of data points, some factors may exist for the entire 12 years but some may dip in and out (active for the first two years, inactive for three years, active for the rest …
Category: Data Science

K-Means R vs K-Means Python different cluster values generating different bar Graphs

Below are 2 sets of code that do the same thing one in Python the other in R. They both graph the Kmeans the same with respect to PCA but once I do the bar chart at the end using the cluster Center the Graphs are totally different. I believe there is something wrong about the Kmeans and the cluster calculation in python. The original code was provided in R. I am trying to see why the bar chart in …
Category: Data Science

Choosing attributes for k-means clustering

The k-means clustering tries to minimize the within-cluster scatter and maximizing the distances between clusters. It does so on all attributes. I am learning about this method on several datasets. To illustrate, in one the datasets countries are compared based on attributes related to their Human development Index. However some of the attributes are completely unrelated to this dimension, for example total population of countries. How to deal with this attributes? As mentioned before k-means tries to minimize the scatter …
Category: Data Science

Optimal clusters for K-means not clear - any ideas?

I have a toy dataset of 10,000 strings of people's names, addresses and birthdays. As a quirk of the data collection process it is highly likely there are duplicate people caused by typos and I am trying to cluster them using K-means. I know there are easier ways of doing this, but the reason I am doing it like this is out of curiosity. In order to vectorize each person I am concatenating the strings as follows: [name][address][birthday] and then …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.