How do CNNs use a model and find the object(s) desired?

Background: I'm studying CNN's outside of my undergraduate CS course on ML. I have a few questions related to CNNs.

1) When training a CNN, we desire tightly bounded/cropped images of the desired classes, correct? I.e. if we were trying to recognize dogs, we would use thousands of images of tightly cropped dogs. We would also feed images of non-dogs, correct? These images are scaled to a specific size, i.e. 255x255.

2) Let's say training is complete. Our model's accuracy seems sufficient, with no problems. From here, let's have a large, HD image of a non-occluded dog running through a field with various obstacles. With a typical NN and some data, we just take the model, cross it with some input, and bam it's going to output some class. How will the CNN view this large image, and then 'find' the dog? Do we run some type of preprocessing on the image to partition it, and feed the partitions?

Topic convolutional-neural-network computer-vision beginner neural-network

Category Data Science


While I am somewhat hesitant to answer, given that I consider myself a beginner, I think I have something to offer so will do my best. I’ve been working my way up the learning curve for the past year and a half and have built my own Feed Forward Fully Connected and Convolutional network solver so not an absolute beginner. OK, so here’s my input to the question. While it is true that CNNs offer some translation invariance, the issue the OP is facing will not be addressed properly by simply feeding a big image with a dog somewhere in the image when the CNN was trained on closely cropped images. The OPs intuition is correct, there is a preprocessing stage. This is about the extent of my knowledge and am also on a course to learn about these techniques. Look up R-CNN (Regions with CNN Features) networks. There are various techniques, one is called segmentation. The image is segmented into smaller sections and various computer vision techniques such as HOG (histogram of gradients) are used to make a “weak” estimation about weather the section is a Region of Interest (ROI), i.e. contains an object of interest. Each of these regions is passed to a trained CNN to determine weather an object it is trained on is in the image. Apparently, the original R-CNN network would pass on average 2000 ROIs to find one object. Faster R-CNN made improvements.


Though there can be a very detailed explanation for this question but I will try to make you understand much minimal words.

1) Cropping the images to a particular size isn't a necessary condition and neither is scaling. But put this way, it doesn't matter whether a dog is represented in a B&W image or RGB image because a convolution network learns features in the images which are independent of colors. Scaling and resizing help to limit the value of pixels between 0 and 1.

2) Once you have trained your CNN model, it has learned all the features like edges,etc. to recognize a dog in the image. Because the model has learned the features, it acquires certain properties like translation invariance which means that no matter where you position a dog in the image, it's still a dog and have the same features. How the model recognize it? It checks for the features of a dog, learned during training, no matter what the size of the new image is or where the dog is in the image or what the dog is doing.

For getting a in-depth understanding you can refer to the following resources:

http://neuralnetworksanddeeplearning.com/chap6.html

http://cs231n.github.io/convolutional-networks/

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.