What ways can i find two similar sets of customers use KNN?

I have a study where i want to find users similar to a set of users (SEED).

My data looks like a pivot by customer e.g. sample of SEED looks like (note i drop cust_id):

cust_id | spend_food | spend_nike | spend_harrods | 
1       | 145        |    45      |      32       | 
2       | 85         |    89      |      0        |  
4       | 23         |    67      |      1900     | 
5       | 84         |    12      |      900      | 

So to find users similar to set SEED from my 'test set' i started thinking about using KNN or any other type of similarity measure. Is the below along the right lines?

nbrs = NearestNeighbors(n_neighbors=2,algorithm='ball_tree').fit(data_seed)
# Next we find k nearest neighbor for each point in object data_seed.
distances, indices = nbrs.kneighbors(data_seed)

#now apply on test set
print(nbrs.kneighbors(test_set))

the output is something like:

 (array([[  1901.51718533,   2304.29615202],
       [   786.55850526,    844.11741209],
       [ 32834.73804174,  35856.9870236 ],
       [  1240.22678184,   1368.8120787 ],
       [  5879.75134223,   6106.69479986],
       [  3796.49773432,   3910.9565544 ],
       [  2860.92799574,   3352.6408945 ],
       [  3313.40896602,   3569.3014983 ],
       [  3572.53834412,   3705.05568968],
       [  4527.76830212,   5181.05057739],

However using these outputs how can I rank them such that i select the most similiar ones to seed set are chosen? Are the distances in affect the 'similarity score'?

treating the distance as similarity score i thought another approach would be that for each sample in test set to set n_neighbours = to length of seed set . This way for each sample in test set i get similarity score to each point. I can then take average of these and then rank smallest to largest in terms of distance and choose the top x%.

is the above approach correct or is there another way to use a similarity/recommender type approach to finding similar group?

Although the above is done on a sample my dataset will be around 100k so am mindful of computational cost.

Topic k-nn cosine-distance similarity recommender-system machine-learning

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.