Image Feature Vectors

I have downloaded a dataset from Amazon. http://jmcauley.ucsd.edu/data/amazon/ Dataset involves feature vectors of images. There are around 1.5 M feature vectors.

Dataset consists of 10 characters (the product ID), followed by 4096 floats (repeated for every product).

Every product image involves feature vectors with (4096x1) size. Feature vectors involve float numbers.

What do these float numbers mean?

What I understood is, there are at total 4096 features, and each index of feature vectors indicate a specific feature. The values in feature vectors indicate the frequency of regarding feature in all specific image.

Is it so? Or, if it is not, what might be the right explanation?

Thanks,

Topic amazon-ml image-recognition image-classification computer-vision feature-extraction

Category Data Science


The same link shows how these features are extracted, with a deep look into the cited article "Image-based recommendations on styles and substitutes":

Features are calculated from the original images using the Caffe deep learning framework [11]. In particular, we used a Caffe reference model with 5 convolutional layers followed by 3 fully-connected layers, which has been pre-trained on 1.2 million ImageNet (ILSVRC2010) images. We use the output of FC7, the second fully-connected layer, which results in a feature vector of length F = 4096.

The reference neural network that was mentioned is the BAIR Reference CaffeNet at the Caffe Model Zoo, which is a slightly modified version of AlexNet.

Since the model was trained over ImageNet, which contains a wide variety of photographs of various categories (1000 of them, if I recall correctly), retrieving the neural codes of one of the layers (obtained just by forward propagation) will give you visual features with a fair representation of the images, even if the network was not specifically trained for Amazon's tasks (such as product recommendation). What these values actually mean is not something that tangible: it is the outcome of multiple 2D convolutions and other normalization and regularization functions, the parameters of which were adjusted specifically for classifying photographs from ImageNet.

The FC7 layer has a rectified linear unit activation (ReLU), which means that they are all non-negative numbers (potentially with several zeros). And since it's a fully connected layer that follows several convolutions, there is no intuitive mapping between a feature index and a certain characteristic of the image. You may picture the network as a highly complex function that yields a high-level representation of the image, under the form of a vector of numbers.

See also the paper Neural Codes for Image Retrieval, where the authors retrieve features from a pre-trained neural network in this fashion, for retrieving images in a different image domain.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.