key generation from feature vectors in high dimentions
I welcome any suggestions to solve the following hard problem:
I have a dataset of float feature vectors of size 512 where each feature vector is extracted from a face image. I want to generate a key given a feature vector (this key can be a number/binary code/etc) that is consistent to each person without comparisons between feature vectors. The only input I have is the given feature vector. for example if I see a photo of me I want to generate a number X. another photo of me will generate the same number X without comparison between the two feature vectors from the two photos.
Assumption1: feature vectors of images from the same person are very close to each other (dot product is high) and feature vectors of images from different persons are far (dot product is low). Assumption2: assume I only have 1K or 10K keys, it is ok if there's some collision in the keys. The key doesn't have to be super unique, but it has to be consistent. Assupmtion3: a given feature vector can be from a face image of a person that is not in the dataset I have.
I tried a few things solution1: the easy solution would be to assign each person a random key and given a new feature vector compare to all and assign the same key. But, I want to generate the key without comparisons (for multiple reasons - this constrain is important). solution2: I tried to normalize the feature vectors so that they sit on the 512-sphere, then tried to divide the sphere with N=10k seeds. Given a feature vector I assign to it the number of the nearest seed. The problem is that clustering algorithms break in very high dimentions (all seeds are far from a new feature vector - the nearst becomes almost random, therefore not consistent to all new feature vectors of same person). solution3: I tried to discretize the feature vectors to generate a binary code (for example using sign on the feature vector) it doesn't produce a consistent enough code for feature vectors of same person.
I appreciate anything you can give.
Topic embeddings deep-learning clustering
Category Data Science