I am working with a data set that, besides customer age and income, tells the balance a customer has in different type of bank accounts: Checking, Shares, Investment, Savings, Deposit, Mortgage, Loan, and Certificates. For accounts other than Checking, 0 represents that the account does not exist for the customers. There are 9800 customer observations with roughly 6000 checking accounts and 4000 savings accounts. For the others, the observations are less than 300. I have to use K-Means Clustering analysis …
I recently received a manuscript for review in which author used ~1000 "fake" data points, so that the final centroid of K-mean stays within the required range. Neither me nor the author seems to have background in data science and the paper is more of application into our research area. I have tried to find published work related to such method of restricting k-mean centers, but failed to do so. However, on simple logic, it seems like valid way, so …
I am using this dataset, the target column is the last one which is 'DEATH_EVENT', I have separated this last one. I am using KMeans to calculate the number of hits and misses. The result is quite bad, I think I should delete some columns or create a loop that deletes. What would you do? import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split X = np.genfromtxt('heart_failure_clinical_records_dataset.csv', delimiter=',') X = np.delete(X, 0, 0) train, test = train_test_split(X, …
I have a dataset which contains ~15 features. With the elbow method, I found out that the optimal number of clusters is probably four. Therefore, I applied the K-means algorithm with four clusters. Now, I would like to understand why these clusters have been formed the way they are. In other words, I would like to identify the shared properties of the points of a specific cluster. My idea is the following: Let's pretend that C1 are the coordinates of …
When given two random points which are not instances in the dataset should I include the centroids in my calculations for the new centroids? For example in this link they are using the starting centroids which are apart of the dataset to calculate the mean for the new centroids. But if given random x and y coordinates lets say [2,1] and [3,2] which are not apart of the dataset do I also include them or just the instances in the …
I already referred these posts here and here. I also posted here but since there is no response, am posting here. Currently, I am working on customer segmentation using their purchase data. So, my data has below info for each customer Based on the above linked posts I see that for clustering, we have to scale the variables if they are in different units etc. But if I scale/normalize all of them to uniform scale, wouldn't I lose the information …
I have a list of songs for each of which I have extracted a feature vector. I calculated a similarity score between each vector and stored this in a similarity matrix. I would like to cluster the songs based on this similarity matrix to attempt to identify clusters or sort of genres. I have used the networkx package to create a force-directed graph from the similarity matrix, using the spring layout. Then I used KMeans clustering on the position of …
I'm trying to implement the unsupervised k-means algorithm for sentiment analysis of imdb movie dataset created by stanford. The steps that I followed is : 1) Load the comments 2) Apply tokenization and stemmetion ,use tf-idf algo to create tfidf matrix. 3) Use k-means algo to divide the data into 2 clusters. My problem is how do I validate the the clusters I have labeled test data. I want to check if all the negative examples go in one cluster …
I am trying to think through my process before doing any real coding. However, got really confused easily. Say I have 100 instruments and I know their price movements every day for a year. So I can create a movement matrix A =[[I1-1, I2-1, .... I100-1], (I1-1 is price for instrument 1 on day 1) [I1-2, I2-2, .... I100-2], .... [I1-365, I-2365, .... I100-365] ] Then for each instrument, I can calculate a price movement correlation between other instruments for …
I see the authors of this paper are measuring F1 and NMI scores to measure the clustering quality. However, I don't understand the algorithm of how they actually measure it. See the Evaluation Section. Although I have looked at the code, I am not sure about the actual algorithm.
Whether correct or not, I'm not able to judge being myself in the early days of the Data Science. However, I have applied a Kmeans on a corpus where some random documents (very short sentences) have been added. These have been vectiorized so to be suitable. With clusterization results at hands, I was somehow expecting the vectors (keyword) to fall only in one cluster at a time (and no more than that). This is not the case. In some circumstances, …
I have been exploring clustering algorithms (K-Means, K-Medoids, Ward Agglomerative, Gaussian Mixture Modeling, BIRCH, DBSCAN, OPTICS, Common Nearest-Neighbour Clustering) with multidimensional data. I believe that the clusters in my data occur across different subsets of the features rather than occurring across all features, and I believe that this impacts the performance of the clustering algorithms. To illustrate, below is Python code for a simulated dataset: ## Simulate a dataset. import numpy as np, matplotlib.pyplot as plt from sklearn.cluster import KMeans …
i know what semantic segmentation is and i know how to do semantic segmentation using deep learning but my question here can i do semantic segmentation with a traditional way like kmeans or mean shift ckustering? here's what i tried to do import numpy as np import cv2 from sklearn.cluster import MeanShift, estimate_bandwidth #from skimage.color import rgb2lab #Loading original image originImg = cv2.imread('test/2019_00254.jpg') # Shape of original image originShape = originImg.shape # Converting image into array of dimension [nb of …
I'm looking to perform a k-means cluster analysis on a set of data that contains variable ranges that contain both positive and negative values. Given the rangers vary so much the data will need to be scaled, but my concern is with the variables that contain negative value ranges. Should I perform some sort of log transformation on all the date so as to scale the data to positive values. For example: Variable A: 3.4, 5.6,1.3,7.6,8.3 Variable B: 1,2,3,2,1 Variable …
I am doing a kmeans clustering on a dataset of selling values of articles. Each article has 52 selling values (one per week). I am trying to automatically calculate the optimum amount of clusters for any unkown dataset. I tried two criteria: The elbow method and the silhouette coefficient. For the silhouette coefficient I got for 1 to 20 clusters values from 0.059 to 0.117 which is (in my opinion) extremely low (heard about a normal of about 0.7). For …
I’m performing PCA on different time series’ and then using K Means clustering to try and group together common factors. The issue I’m facing is that some of the factors come in and out of the time series. For example I may have 12 years in total of data points, some factors may exist for the entire 12 years but some may dip in and out (active for the first two years, inactive for three years, active for the rest …
Below are 2 sets of code that do the same thing one in Python the other in R. They both graph the Kmeans the same with respect to PCA but once I do the bar chart at the end using the cluster Center the Graphs are totally different. I believe there is something wrong about the Kmeans and the cluster calculation in python. The original code was provided in R. I am trying to see why the bar chart in …
The k-means clustering tries to minimize the within-cluster scatter and maximizing the distances between clusters. It does so on all attributes. I am learning about this method on several datasets. To illustrate, in one the datasets countries are compared based on attributes related to their Human development Index. However some of the attributes are completely unrelated to this dimension, for example total population of countries. How to deal with this attributes? As mentioned before k-means tries to minimize the scatter …
I am looking for an unsupervised method that can see also the points that start to look different from the majority. Which clustering techniques (I use python) can be used for such data sets? I have tried k-means but as I was expecting it has failed considerably to see such peaks.
I have a toy dataset of 10,000 strings of people's names, addresses and birthdays. As a quirk of the data collection process it is highly likely there are duplicate people caused by typos and I am trying to cluster them using K-means. I know there are easier ways of doing this, but the reason I am doing it like this is out of curiosity. In order to vectorize each person I am concatenating the strings as follows: [name][address][birthday] and then …