Sneakers representation learning

I am trying to make a model which would take an image of shoes as an input and output a meaningful N-dimensional embedding of the shoes, so that they could be searchable/comparable/clustered and used in a recommender system.

My first guess was to employ a siamese CNN (Densesnet + 1 extra fully connected layer for the 32-dimensional embedding generation) with online hard mining triplet loss. So the idea was to train the network on making prediction if shoes on images belong to the same shoe model based on the euclidean distance between the outputs.

However, it failed to generalize well and has shown poor result on the test data. My current dataset is ~4k images of ~500 different sneaker models.

What are my options in this situation?

Topic siamese-networks cnn image-recognition deep-learning feature-extraction

Category Data Science


Best is to start from pre-trained encoding. Please check TFHub (https://tfhub.dev/google/collections/image/1).

I have worked myself with encoding of shoes for a sport fashion companies. Remember that there are many different ways images might be seen as being similar, e.g. shape, color(!) etc... You might need to make a pre-processing choice first to drop some aspects, such as working with gray images, blurring some patterns on the shoes etc...

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.