sum of distances from N-points to set of other M-points in R

Imagine two related problems:

  1. I have one 2-dim data point and a set of $M$ 2-dim other data points. How to calculate sum of all distances between one point and those $M$ points? Result is one number.

  2. Now I have $N$ 2-dim points and same set of $M$ 2-dim data points as above. How to calculate sums of all distances between $N$ points and those $M$ points? It should be equivalent to looping through $N$ points and getting those sums. Result is $N$ numbers.

This problem relates to clustering. I extracted clusters from calibration data with kmeans, but now I want to identify to which cluster my new point(s) belong to. Of course, simple looping is inefficient.

UPDATE:

This is an R question.

Mathematical formulation:

$(x,y)_i$ is $N$-set, $(X,Y)_k$ is $M$-set.

$$\sum_{k=1}^M (x_i - X_k)^2 + (y_i - Y_k)^2 = d^2_i$$

UPDATE2:

One of the methods I discovered is to split everything into two steps:

  1. (calibration) identify clusters with some method like stats::kmeans(). It provides classes for the entire dataset.
  2. (backtesting) split dataset into train- and test-subsamples and use class::knn(). Cluster-IDs from train-subsample are assumed to be 'TRUE'. At the output it will deliver cluster-IDs identified by knn within test-subsample. test cluster-IDs are chosen with Euclidean metrics, exactly as I need.

Although I do not have control over the process it still delivers satisfactory result and speed.

The realtime solution, where single point has to be classified, can be implemented with calibrated sample (after step 1).

Topic rstudio r clustering machine-learning

Category Data Science


First, let's create a function for the calculation:

dist_func <- function(a, b){
  sqrt((a$x-b$x)^2 + (a$y-b$y)^2)
}

Now, we will create a dataset to tackle the number 1 case:

source <- data.frame(x=22.78, y= 73.27)

set.seed(4)
destination <- data.frame(x=runif(20, 22, 23), y=runif(20,77,78))

#Now just call the function and sum up

> sum(dist_func(source, destination))
[1] 86.76514

For case 2, we will use a for loop to store the results:

#Let's change the source

set.seed(4)
source <- data.frame(x=runif(5, 22, 23), y=runif(5,77,78))

#We will store the result in this vector

dist_output <- c()

for (i in 1:nrow(source)) {
  
  dist_output[i]<-sum(dist_func(source[i,], destination))
}

> dist_output
[1]  9.689657 12.537179 10.379821 11.005016 13.006207

Let me know in the comments, if that solved your purpose.


Why not do (pseudo code):

from sklearn.metrics import euclidean_distances

#X.shape = (M, 2)

Y = euclidean_distances(X, [[x1, y1]])

Where the 2-dim point (as in problem 1) is (x1, y1)

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.