clustering 2-dimensional euclidean vectors - appropriate dissimilarity measure
I've got a set of approx. 50 000 2-dimensional euclidean vectors which are connected with 20 groups, i.e. each group has approx. 2500 2-dimensional euclidean vectors. My data includes endpoints coordinates, i.e. $x_0, y_0, x_1, y_1$. Now I would like to cluster the vectors within these groups, probably using k-means/k-medoids clutering (or other clustering algorithm with pre-defined no. of clusters). What is also important, my main focus is on vector's direction, length is the minor concern (but at best, still should be taken into conideration). What I'm struggling with is a choice of dissimilarity measure that would be suited to my problem. So here are my question:
- Does it matter how the data is specified? Alternatively, I could calculate an angle and length of vector and specify the data as $x_0, y_0, angle, length$. My intuition is that if angle is explicitely present, the euclidean distance should do a better job capturing the vector's direction. What is more I could maybe use some weighting, modify a euclidean distance and calculate distance between two observations as for example:
$\sqrt{(x^1_0 - x^2_0)^2 + (y^1_0 - y^2_0)^2 + (angle^1-angle^2)^2 + \frac{1}{n}(length^1-length^2)^2}$
where $n$ is some constant.
I also considered angular distance as a dissimilarity measure. From what I know this is equivalent to clustering the standarised data points and therefore doesn't capture size (lengths in my case). But I'm not sure if k-means clustering can be done with cosine distance. If so, is there any package in R that allows that?
Is is a good and statistically valid idea to perform clustering twice: firstly, to cluster starting points and secondly, within those clusters perform clustering for angles and lengths?
Do you guys know any papers regarding similar problem, i.e. clustering the 2-dimensional data points? Any example would be very handy.
Topic cosine-distance distance similarity k-means clustering
Category Data Science