clustering 2-dimensional euclidean vectors - appropriate dissimilarity measure

I've got a set of approx. 50 000 2-dimensional euclidean vectors which are connected with 20 groups, i.e. each group has approx. 2500 2-dimensional euclidean vectors. My data includes endpoints coordinates, i.e. $x_0, y_0, x_1, y_1$. Now I would like to cluster the vectors within these groups, probably using k-means/k-medoids clutering (or other clustering algorithm with pre-defined no. of clusters). What is also important, my main focus is on vector's direction, length is the minor concern (but at best, still should be taken into conideration). What I'm struggling with is a choice of dissimilarity measure that would be suited to my problem. So here are my question:

  1. Does it matter how the data is specified? Alternatively, I could calculate an angle and length of vector and specify the data as $x_0, y_0, angle, length$. My intuition is that if angle is explicitely present, the euclidean distance should do a better job capturing the vector's direction. What is more I could maybe use some weighting, modify a euclidean distance and calculate distance between two observations as for example:

$\sqrt{(x^1_0 - x^2_0)^2 + (y^1_0 - y^2_0)^2 + (angle^1-angle^2)^2 + \frac{1}{n}(length^1-length^2)^2}$

where $n$ is some constant.

  1. I also considered angular distance as a dissimilarity measure. From what I know this is equivalent to clustering the standarised data points and therefore doesn't capture size (lengths in my case). But I'm not sure if k-means clustering can be done with cosine distance. If so, is there any package in R that allows that?

  2. Is is a good and statistically valid idea to perform clustering twice: firstly, to cluster starting points and secondly, within those clusters perform clustering for angles and lengths?

  3. Do you guys know any papers regarding similar problem, i.e. clustering the 2-dimensional data points? Any example would be very handy.

Topic cosine-distance distance similarity k-means clustering

Category Data Science


For this kind of situation, spectral clustering is an intuitive solution. Basically, the idea is to perform the k-means clustering in a transformed feature space, by defining what the inner product should be in that space.

The main point is to give yourself a similarity measure. In your case, this could be:

$$S(v_1, v_2) = exp(-\frac{(x_0^{(1)}-x_0^{(2)})^2+(y_0^{(1)}-y_0^{(2)})^2}{2\sigma_{start}^2} - \frac{(l^{(1)} - l^{(2)})^2}{2\sigma_{l}^2} - \frac{(\theta^{(1)} - \theta^{(2)})^2}{2\sigma_{\theta}^2})$$

where:

  • the $^{(1)}$ and $^{(2)}$ supscripts relate to vectors 1 and 2
  • $x_0$ and $y_0$ are the vector starting point coordinates
  • $l$ denotes the vector length (euclidian norm)
  • $\theta$ denotes the angle
  • $\sigma_{start}$, $\sigma_{l}$ and $\sigma_{\theta}$ are custom parameters that you should adjust to give more or less importance to each aspect of your vector (a low $\theta$ value will mean that the corresponding feature will be dealt with high sensitivity)

Then you should build the graph laplacian matrix and get the eigen vectors associated to the lowest eigen values, and project your data on these eigen vectors. You get a higher dimensional space, but your data will be easily separable by k-means algorithm.

The key is to adjust the $\sigma$ values well to get the clustering that you need.

Please note that this may be computationally intensive if your data contains too many points. You might want to use a smaller subset to find the right projection and cluster centers.


$(x_0,y_0,x_1,y_1)$ will be better behaved than, e.g., $(x_0,y_0,angle,length)$ (because the angle has a very different scale). Alternatively, you could also use $(x_0,y_0,x_1-x_0,y_1-y_0)$.

As you want to consider both points, you of course should use both.

But if all vectors are very short, you may still need to carefully scale attributes for best results, or the clustering may be mostly based on the starting point.

Euclidean distance on these vectors may be fine. The x and y values are all in Euclidean space, and presumably a difference of 1 in x is the same as a difference of 1 in y, isn't it? This doesn't hold anymore if you used the angle! One could argue that the two points should be considered independent. It's okay to try to take the sum of two distances in $(x_0,y_0)$ and $(x_1,y_1)$. But I don't expect the results to be very different. But for k-means, I'd stick to the usual sum of squares (it doesn't minimize Euclidean distances, but squared errors). If your data is too noisy, consider more robust algorithms such as DBSCAN.


I hope I followed your question correctly. If you have those data points on as 2D vectors, that would mean, that you have, say $N=50,000$ data points each represented as $\left(x_i,y_i\right)$ , right? If angle and length are what you are concerned about, then your formulation seems right. You can convert each data point $\left(x_i,y_i\right)$ into $\left(\theta_i, d_i\right)$, which are the angle and length of the vector. These two parameters $\left(\theta_i, d_i\right)$ act as two features in each of your samples and you can run k-means on this. You have to be careful about using a consistent measure on your angle (always anti clockwise or clockwise). Your dissimilarity measure seems quite correct as far as I can tell. I'm sure you are aware of this python package but just for the sake of completion you can use this or in MATLAB you can use this. I'm not sure about R, so pardon me if that was the essential part of your question.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.