Generating Image data sets for training CNNs

I want to build a system that recognises (with a given uncertainty) the make and model of a car from an image. I have decided to use Convolution Neural Networks, specifically the Caffe framework.

My next decision is how best to build my data set. According to this book, I need around 5000 data points for each class (so lets say ~500k images).

I've done a little bit of reading on here and other resources, and it seems that the Google Custom Search API is a potential option; but that limits me to at most 100 searches per day (for free). I thought about building a script to scrape sites like Autotrader; but my experience with scraping the web is zero.

Does anyone have experience generating image data sets of this size? Any pearls of wisdom that you could share with me? I am happy to invest time and effort learning for example Beautiful Soup or this Google API, but I don't want to waste time going down the wrong rabbit hole.

Topic caffe neural-network dataset

Category Data Science


There are many different publicly available datasets out there, and most come with a paper describing how the dataset was acquired. Almost nobody takes a camera and starts taking thousands of pictures themselves. You may find some inspiration by looking at those papers and adapting their methods for finding images.

A very popular way is to download the images from Flickr: This is a photo platform where users share their photos and add comments or tags, describing the contents of the images. Flickr also has an API to find and download images.

A couple of test queries show that there are thousands of photos available:

Query            No of Matches
-------------------------------
VW Passat             57,702
Ford Focus           187,344
Toyota Corolla        81,529
Mitsubishi Lancer    126,242

However, this is not a clean high-quality dataset: it includes old models, wrong tags, photos from the interiors, and so on. Still, it might be a good starting point to acquire huge numbers of images.

Dataset Cleanup

Maybe you can live with a couple of low-quality images, but I guess the better way is to clean the dataset up. There are a lot of different possible steps - some may not be needed in your case, and you might need other or additional steps:

  • Remove non-car photos. You probably don't want photos from car interiors, or photos that don't show the car at all. You could e.g. classify all your images with an ImageNet classifier and discard all images that aren't recognized as "car".
  • Use image retrieval algorithms (e.g. SIFT descriptors and matching) to build a graph containing all images and their similarities, as described in [1]. Discard all images that have very little similarity to the rest of the images (or at least review those images).
  • Manual labelling. This is the best way to ensure that you have a really high-quality dataset. Have someone go through all images and make sure that they satisfy all conditions you have and are labelled correctly. This is very very expensive, but will definitely give you the best results. If you don't really need to - don't do this. If you have to, you could rely on the Mechanical Turk or similar websites.

Licensing

Flickr's API description says that:

The Flickr API is available for non-commercial use by outside developers. Commercial use is possible by prior arrangement.

Important note: all photos are property of their respective owners. All images on Flickr have specific license conditions, which you can also query through the API. A list of the available licenses on Flickr is available on their website. You have to make sure you don't infringe the copyrights of the respective owners. Especially if your work is commercial, this complicates things.

References

[1]: Gordo, A., Almazan, J., Revaud, J., & Larlus, D. (2016). Deep Image Retrieval: Learning global representations for image search. arXiv: 1604.01325.


I decided turning my comment into an answer.

If you want to go pro, use a framework such as scrapy.

Personally, I find them overly-cumbersome and I have been successful using the following approach. I think your use case is simple enough for it to be of use to you as well.

Assuming you are using Python3 as well, you can grab a webpage easily, and then get what you want using XPath notation.

from lxml import html
import urllib.request

# keep running until there are no "next" pages
for page in range(999):  
    url = 'http://blablabla.com/?page=%d' % page
    text = urllib.request.urlopen(url).read()
    tree = html.fromstring(text)
    images = tree.xpath('//img[@class="car"]/href()')
    types = tree.xpath('//div[@class="type"]/text()')
    if not images:
        break
    for i, (cartype, image) in enumerate(zip(types, images)):
        urllib.request.urlretrieve(image, '%s-page%d-img%d.png' % (cartype, page, i))

(Purely illustrative example.)

Now adjust as you may. XPath is an incredibly powerful notation of accessing XML nodes. Much more is possible than I am writing here. Take this tutorial to learn the full XPath syntax.

Some web designers make it much harder to access whatever you want because they do not properly class-ify their HTML objects. In those cases, you may have to access a parent node and ask for their child. Or access a sibling and then get the siblings. Anyway, XPath and Python's lxml package make all this incredibly easy.

Any modern browser like Chrome and Firefox also lets you easily explore the DOM of any webpage. Just right-click and press Inspect or go to Developer Tools in the Tools menu or some such.

Note: some website like scholar.google.com disallow scrapers and are very good at detecting if that's what you're doing. You can specify an user-agent for urllib, but it might be futile. Even advanced frameworks may not be able to help you there.

EDIT: I have made a blog post where I elaborate a little more.


Have you looked at the Stanford Car Dataset? It has about 16k images of 200 cars. While it does not have the number of images you are looking for, it does seem sufficient for building a classifier (see references below)

This blog post by Justin Chien provides a good overview of the approach for building the classifier on this data using CNN. And this paper also provides an overview of a few different approaches.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.