General equation for getting an idea of the scale of a machine learning project

I'm writing an application for a project where we intend to teach a model to predict one aspect of an environment (traffic safety) using a database with 10 images (about 300x300px and, say, 256 colors) for each of either 100 000 or 15 million locations.

I must come to grasp with if both, one or none of these projects are feasible with our hardware constraints. What can I expect? Is there some formula or benchmark that one can refer to? Will one be able to do this on a laptop with a decent GPU, a dedicated ML computer or does it require the level of infrastructure that Google and Amazon use?

Topic hardware project-planning

Category Data Science


Interesting but difficult question! It depends on the efficiency of your algorithm, both in terms of training and scoring/predicting, but to get a first idea I would go by the amount of data that we're talking about.

256 colors are 8 bits per pixel, times 300x300 pixels. Uncompressed, you have 720 kB per image. 10 images per location: 7.2 MB data per location. 100k locations: 720 GB of uncompressed input data, 15M locations: 108 TB. If you compress your data, meaning you store them a JPG or something, I don't know, but I would expect you need about a factor 10 less storage (it depends how easily the images can be compressed and how well JPG compression works).

Given unlimited time (and storage), any amount of data can be crunched on a laptop, although my laptop doesn't have 10 TB storage. But I would expect 720 GB worth of uncompressed data to be impractical on a single computer, unless you have an exceptionally efficient algorithm. That would be an algorithm that might be specifically designed so that it can be trained on all the data on a single computer, not a CNN with a bunch of layers that I expect you have in mind.

Using distributed computing in the cloud or on-premise has its cost too, in terms of infrastructure cost, complexity to debug, etc. But I would expect that with this kind of dataset, it's worth it.

That's something different from "the level of infrastructure that Google uses", which is millions of servers. I would start processing with, for example, 100 cores, and see how it goes. Don't forget to turn them off when you're done!

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.