Sneakers representation learning
I am trying to make a model which would take an image of shoes as an input and output a meaningful N-dimensional embedding of the shoes, so that they could be searchable/comparable/clustered and used in a recommender system.
My first guess was to employ a siamese CNN (Densesnet + 1 extra fully connected layer for the 32-dimensional embedding generation) with online hard mining triplet loss. So the idea was to train the network on making prediction if shoes on images belong to the same shoe model based on the euclidean distance between the outputs.
However, it failed to generalize well and has shown poor result on the test data. My current dataset is ~4k images of ~500 different sneaker models.
What are my options in this situation?
Topic siamese-networks cnn image-recognition deep-learning feature-extraction
Category Data Science