What clustering algorithm is best for dataset with only binary categorical features

I have a dataset with a lot of binary categorical features and a single continuous target value. I would like to cluster them but I am not quite sure what to use? I have in the past used dbscan for something similar and it worked well, but that dataset also had lots of continuous features. Do you have any tips and suggestions? Would you suggest matrix factorization and then cluster? Thank you very much!
Category: Data Science

Ways of calculating the area of colored regions in a map

Background I am a PHD student trying to improve my data science. One of my research projects, has me tasked with determining the size of the clusters in a colored image of regions. Here is an example image I am using. The coloring is natural as it represents the orientation of the microscope light. The light hits the surface in different ways resulting in the different colors. But I'm not trying to sum regions of similar colors, but instead just …
Category: Data Science

What are the benefits of using spectral k-means over simple k-means?

I have understood why k-means can get stuck in local minima. Now, I am curious to know how the spectral k-means helps to avoid this local minima problem. According to this paper A tutorial on Spectral, The spectral algorithm goes in the following way Project data into $R^n$ matrix Define an Affinity matrix A , using a Gaussian Kernel K or an Adjacency matrix Construct the Graph Laplacian from A (i.e. decide on a normalization) Solve the Eigenvalue problem Select …
Category: Data Science

kMeans on graph Laplacian to cluster nodes based on their distance

I have a connected weighted graph and I want to use kMeans to cluster the points based on their distance (smaller distances indicate that the nodes are more likely to be in the same cluster). I computed the Laplacian of the graph and chose the eigenvectors that have corresponding nonzero eigenvalues. I then performed the kMeans in this embedding space represented by the chosen eigenvectors. In the following example, I want to create 2 clusters and I chose nodes $1$ …
Category: Data Science

How to use spectral clustering to predict?

In an academic paper, they talk about using a nearest neighbour algorithm to predict the cluster of a new point. And how the number of nearest neighbours is set to 10 in their example. What do they mean with this? The two things I could think of were: Look which 10 points used in the training set (neighbours) are closest and then assign it to the cluster of which the majority of the points come from. Collect one by one …
Category: Data Science

Is subspace clustering better than spectral clustering?

After reading a few papers about subspace clustering (e.g. the one by Elhamifar and Vidal), it looks like subspace clustering includes scenario of applying spectral clustering: it works for data distributed in a union of subspaces, while spectral clustering works for just one subspace; also some algorithms of subspace clustering (e.g. the linked reference) uses spectral clustering as one step in their algorithm. So I wonder if that means subspace clustering always performs better than spectral clustering? If those subspaces …
Category: Data Science

Why kmeans cluster breakup is like this

I have a galaxy spectrum data set (total 22000). Similar to an electronic wave data, two dimensional (Flux vs Wavelength). A typical set of wavelength plot looks like below Now I am doing kmeans on this data set to cluster them based on their spectrum shape/pattern only (using sci-kit learn). Some results of the k means clustering is baffling me, I have made a flow chart of how the candidates clustered as I would go on increasing the number of …
Category: Data Science

Clusterize Spectrum

I have pandas table which contains data about different observations, each one was measured in different wavlength. These observsations are different than each other in the treatment they have gotten. The table looks something like this: >>>name treatment 410.1 423.2 445.6 477.1 485.2 .... 0 A1 0 0.01 0.02 0.04 0.05 0.87 1 A2 1 0.04 0.05 0.05 0.06 0.04 2 A3 2 0.03 0.02 0.03 0.01 0.03 3 A4 0 0.02 0.02 0.04 0.05 0.91 4 A5 1 0.05 …
Category: Data Science

What are practical differences between kernel k-means and spectral clustering?

I've been lately wondering about kernel k-means and spectral clustering algorithms and their differences. I know that spectral clustering is a more broad term and different settings can affect the way it works, but one popular variant is using K-means clustering on the spectral embedding of affinity matrix. On the other hand kernel K-means applies K-means clustering directly to the affinity matrix. Therefore one immediate, theoretical difference is it omits spectral embedding step, i.e. it doesn't look for the lower-dimensional …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.