What ways can i find two similar sets of customers use KNN?
I have a study where i want to find users similar to a set of users (SEED).
My data looks like a pivot by customer e.g. sample of SEED looks like (note i drop cust_id):
cust_id | spend_food | spend_nike | spend_harrods |
1 | 145 | 45 | 32 |
2 | 85 | 89 | 0 |
4 | 23 | 67 | 1900 |
5 | 84 | 12 | 900 |
So to find users similar to set SEED from my 'test set' i started thinking about using KNN or any other type of similarity measure. Is the below along the right lines?
nbrs = NearestNeighbors(n_neighbors=2,algorithm='ball_tree').fit(data_seed)
# Next we find k nearest neighbor for each point in object data_seed.
distances, indices = nbrs.kneighbors(data_seed)
#now apply on test set
print(nbrs.kneighbors(test_set))
the output is something like:
(array([[ 1901.51718533, 2304.29615202],
[ 786.55850526, 844.11741209],
[ 32834.73804174, 35856.9870236 ],
[ 1240.22678184, 1368.8120787 ],
[ 5879.75134223, 6106.69479986],
[ 3796.49773432, 3910.9565544 ],
[ 2860.92799574, 3352.6408945 ],
[ 3313.40896602, 3569.3014983 ],
[ 3572.53834412, 3705.05568968],
[ 4527.76830212, 5181.05057739],
However using these outputs how can I rank them such that i select the most similiar ones to seed set are chosen? Are the distances in affect the 'similarity score'?
treating the distance as similarity score i thought another approach would be that for each sample in test set to set n_neighbours = to length of seed set . This way for each sample in test set i get similarity score to each point. I can then take average of these and then rank smallest to largest in terms of distance and choose the top x%.
is the above approach correct or is there another way to use a similarity/recommender type approach to finding similar group?
Although the above is done on a sample my dataset will be around 100k so am mindful of computational cost.
Topic k-nn cosine-distance similarity recommender-system machine-learning
Category Data Science