I have a dataset with a lot of binary categorical features and a single continuous target value. I would like to cluster them but I am not quite sure what to use? I have in the past used dbscan for something similar and it worked well, but that dataset also had lots of continuous features. Do you have any tips and suggestions? Would you suggest matrix factorization and then cluster? Thank you very much!
Background I am a PHD student trying to improve my data science. One of my research projects, has me tasked with determining the size of the clusters in a colored image of regions. Here is an example image I am using. The coloring is natural as it represents the orientation of the microscope light. The light hits the surface in different ways resulting in the different colors. But I'm not trying to sum regions of similar colors, but instead just …
I have understood why k-means can get stuck in local minima. Now, I am curious to know how the spectral k-means helps to avoid this local minima problem. According to this paper A tutorial on Spectral, The spectral algorithm goes in the following way Project data into $R^n$ matrix Define an Affinity matrix A , using a Gaussian Kernel K or an Adjacency matrix Construct the Graph Laplacian from A (i.e. decide on a normalization) Solve the Eigenvalue problem Select …
I have a connected weighted graph and I want to use kMeans to cluster the points based on their distance (smaller distances indicate that the nodes are more likely to be in the same cluster). I computed the Laplacian of the graph and chose the eigenvectors that have corresponding nonzero eigenvalues. I then performed the kMeans in this embedding space represented by the chosen eigenvectors. In the following example, I want to create 2 clusters and I chose nodes $1$ …
In an academic paper, they talk about using a nearest neighbour algorithm to predict the cluster of a new point. And how the number of nearest neighbours is set to 10 in their example. What do they mean with this? The two things I could think of were: Look which 10 points used in the training set (neighbours) are closest and then assign it to the cluster of which the majority of the points come from. Collect one by one …
After reading a few papers about subspace clustering (e.g. the one by Elhamifar and Vidal), it looks like subspace clustering includes scenario of applying spectral clustering: it works for data distributed in a union of subspaces, while spectral clustering works for just one subspace; also some algorithms of subspace clustering (e.g. the linked reference) uses spectral clustering as one step in their algorithm. So I wonder if that means subspace clustering always performs better than spectral clustering? If those subspaces …
I have a galaxy spectrum data set (total 22000). Similar to an electronic wave data, two dimensional (Flux vs Wavelength). A typical set of wavelength plot looks like below Now I am doing kmeans on this data set to cluster them based on their spectrum shape/pattern only (using sci-kit learn). Some results of the k means clustering is baffling me, I have made a flow chart of how the candidates clustered as I would go on increasing the number of …
I have pandas table which contains data about different observations, each one was measured in different wavlength. These observsations are different than each other in the treatment they have gotten. The table looks something like this: >>>name treatment 410.1 423.2 445.6 477.1 485.2 .... 0 A1 0 0.01 0.02 0.04 0.05 0.87 1 A2 1 0.04 0.05 0.05 0.06 0.04 2 A3 2 0.03 0.02 0.03 0.01 0.03 3 A4 0 0.02 0.02 0.04 0.05 0.91 4 A5 1 0.05 …
I've been lately wondering about kernel k-means and spectral clustering algorithms and their differences. I know that spectral clustering is a more broad term and different settings can affect the way it works, but one popular variant is using K-means clustering on the spectral embedding of affinity matrix. On the other hand kernel K-means applies K-means clustering directly to the affinity matrix. Therefore one immediate, theoretical difference is it omits spectral embedding step, i.e. it doesn't look for the lower-dimensional …
Which version of spectral clustering is implemented in sklearn library? Is it Shi, Malik or Ng, Jordan, Weiss from this tutorial? In sklearn user guide, both versions are mentioned in reference. From the source code, it is not trivial to understand what is implemented as the authors used some tricks for code optimization.