Customer Segmentation: Should I use a variable, representing a product, that is unpopular in the dataset for K-Means Clustering?

I am working with a data set that, besides customer age and income, tells the balance a customer has in different type of bank accounts: Checking, Shares, Investment, Savings, Deposit, Mortgage, Loan, and Certificates. For accounts other than Checking, 0 represents that the account does not exist for the customers. There are 9800 customer observations with roughly 6000 checking accounts and 4000 savings accounts. For the others, the observations are less than 300. I have to use K-Means Clustering analysis …
Category: Data Science

Clustering time series data using dynamic time warping

I would like to cluster/group the curves in the attached picture with Python. The data is already normalized and my approach would be to use dtw (dynamic time warping) to calculate the distance and with that feature use a clustering algorithm (like kmeans or DBSCAN) to classify them. Do I pick out one trajectory as a starting curve to compare the other curves to, or do I calculate an 'average' curve of all curves and use that as the starting …
Category: Data Science

Inference from text data without label or Target

I have a use case where I have text data entered by an approver while approving of some loan. I have to make some inferences as to what could be the reasons for approval using NLP. How should I go about it? It's a Non english language. Can Clustering of text help?? Is it possible to cluster TEXT OF non English language using python libraries.
Category: Data Science

Retrive image from from features represented by histograms of oriented gradients

I am using histogram of oriented gradients for image classification using clustering in scikit learn. I am using hog from scikit-image to generate hog from 512x512 grayscale image. Here is an example: fd, hog_image = hog(image, orientations=8, pixels_per_cell=(16, 16), cells_per_block=(1, 1), visualize=True, channel_axis=-1) Where fd is used as features in classification. I wonder if there is a way to retrieve image from fitted coefficients in clustering model, in order to see how features differ between the clusters.(i.e go from fd …
Category: Data Science

Dendrogram: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()

I am trying to plot a Dendrogram to cluster data but this error is stopping me. My datea is here. I first chose columns to work with: df_euro = pd.read_csv('https://assets.datacamp.com/production/repositories/655/datasets/2a1f3ab7bcc76eef1b8e1eb29afbd54c4ebf86f2/eurovision-2016.csv') samples = df_euro.iloc[:, 2:7].values[:42] country_names = df_euro.iloc[:, 1].values[:42] # Calculate the linkage: mergings mergings = linkage(samples , method = 'complete') # Plot the dendrogram dendrogram( mergings, labels = y, leaf_rotation = 90, leaf_font_size = 6 ) plt.show() But I'm getting this error which I can't understand. I googled it and …
Category: Data Science

How to decide who to market? Clustering or Decision Tree?

I am working with a dataset that has enough observations and ~ 10 variables, half of the variables are numeric another half of the variables are categorical with 2-3 levels (demographics) one ID variable one last variable that has sales value, 0 for no sale and bill amount for sale Using this information, I want to understand which segments of my customers to market. I am using R for code but that's not relevant here. :) I am confused about …
Category: Data Science

Conditional clustering

I have a dataset consisting of addresses (points) that have several attributes; one that distinguishes the "sort" of address and one attribute that contains a numerical value. I want to cluster these points based on: their distance from each other the sort of address However, the summed numerical attribute per cluster cannot exceed a certain threshold value. In other words, the system needs to form clusters but needs to stop clustering as soon as the sum of the numerical value …
Category: Data Science

Cluster words into groups of similar meaning (synonyms)

How can words be clustered into groups of similar meaning (synonyms)? I started with pre-trained word embeddings (e.g., Google News), which is great, but not perfect - a limitation arises because the word embeddings are based on surrounding words. This introduces challenging results. For example: polar meanings: word embeddings might find opposites to be similar. Even though these words mean the opposite semantically, they can quite readily be interchanged given the same preceding and following words. For example, "terrible" and …
Category: Data Science

Find the shared properties of cluster samples

I have a dataset which contains ~15 features. With the elbow method, I found out that the optimal number of clusters is probably four. Therefore, I applied the K-means algorithm with four clusters. Now, I would like to understand why these clusters have been formed the way they are. In other words, I would like to identify the shared properties of the points of a specific cluster. My idea is the following: Let's pretend that C1 are the coordinates of …
Category: Data Science

How would you describe cluster 2 from this output of a run of the EM program?

My description: Cluster 2 consists of 9511 instances, the age is around 42 (ranges between 29.7207 and 54.5257). Considering Age, Cluster 2 is very well separated from Cluster 1, with a distance of 18.9513. On the other hand, Cluster 2 and Cluster 0 are very close though, their centroids are withihn a distance of around 0.8248. What else could be added?
Category: Data Science

Calculating new centroids when the centroids are chosen at random

When given two random points which are not instances in the dataset should I include the centroids in my calculations for the new centroids? For example in this link they are using the starting centroids which are apart of the dataset to calculate the mean for the new centroids. But if given random x and y coordinates lets say [2,1] and [3,2] which are not apart of the dataset do I also include them or just the instances in the …
Category: Data Science

Interpreting cluster variables - raw vs scaled

I already referred these posts here and here. I also posted here but since there is no response, am posting here. Currently, I am working on customer segmentation using their purchase data. So, my data has below info for each customer Based on the above linked posts I see that for clustering, we have to scale the variables if they are in different units etc. But if I scale/normalize all of them to uniform scale, wouldn't I lose the information …
Category: Data Science

How do you extract speerate structures from a cluster of points in 2d cordinate

I have a bunch points in x,y that correspond to so physical processes. My goal to extract and group points based on the event/process the correspond to. The image attached shows a example of how the data looks like. By inspection you can clearly make out at least 2 curves that correspond to process I want. The data itself has a lot of noise and some false positive events. I have already played around with Dbscan and it doesnt quite …
Topic: clustering
Category: Data Science

Assigning a new document to a cluster based on keywords extracted and tf-idf

I have about 40 clusters of documents defined by a combination of k-means clustering algorithm and hand curation. For example, some of the clusters given by k-means are too noisy so they have been further subdivided. Now I want to assign new documents to these clusters. I found that it is possible to extract keywords using tf-idf based methods as mentioned here. My approach is to extract key terms from each of these clusters using tf-idf based method and I …
Category: Data Science

Clustering with hierarchical data dependencies

I am currently looking into how to cluster data with hierarchical dependencies. An example of a problem that I want to cluster: we would like to cluster cities to identify similar characteristics with respect to inhabitants. As input data, I have some characteristics such as the age, weight, height and sex of the inhabitants. Each city will therefore be modeled by a vector : ______________ _ _ number of people aged 20 years old | x_1 | number of people …
Category: Data Science

Clustering 2D curves

I have a set of curves in 2D space each expressed as a set of (sampled) data points. Each set has more or less the same number of items - eventually I guess I’ll use binning to make sure the number of points is the same (say 50) if that can help. I would like to cluster the curves in N groups. Computing N should be part of the solution. Possible translations on the first dimension are irrelevant. I have …
Category: Data Science

Perform clustering from a similarity matrix

I have a list of songs for each of which I have extracted a feature vector. I calculated a similarity score between each vector and stored this in a similarity matrix. I would like to cluster the songs based on this similarity matrix to attempt to identify clusters or sort of genres. I have used the networkx package to create a force-directed graph from the similarity matrix, using the spring layout. Then I used KMeans clustering on the position of …
Category: Data Science

Why Do a Set of 3 Clusters Across 1 Dimension and a Set of 3 Clusters Across 2 Dimensions Form 9 Apparent Clusters in 3 Dimensions?

I am sorry if this is a well-known phenomenon but I can't quite wrap my head around this. I have a related question: How To Develop Cluster Models Where the Clusters Occur Along Subsets of Dimensions in Multidimensional Data?. There are good answers for feature selection and cluster metrics but I think this phenomenon deserves special attention. I have simulated 3 clusters along 1 dimension, and then simulated 3 clusters along 2 dimensions, and then combined them into a dataset …
Category: Data Science

Grouping/clustering similar words python

I have a question regarding grouping of similar words for example I have list of words give below: artificialintelligence Artificial Intelligence AI Machine Learning ML Data Analytics Data & Analytics I want to group these words into [Artificial intelligence, machine Learning, Data Analytics] I have used difflib.get_close_matches() but that does not give me desired result For example this is how difflib group: Information Technology': ['Information Technology','Mobile Technology', 'newtechnology'] I have also used fuzz.token_set_ratio() but that also does not provide me …
Category: Data Science

How to divide earth into polygons based on a collection of labeled coordinates?

I have around one million labeled coordinates(latitude, longitude) all around the world, with around 10,000 unique labels(location_id). Each point corresponds to exactly one class(location_id). Each class is densely distributed over 1-10km. radius; With more density around its centroid. How can I create an earth multi-polygon consisting of 10,000 polygons? Basically dividing the earth into 10,000 polygons. The separation would be based on the density of points in each location. The more points clumped in a location, the bigger its polygon's …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.