Methodology for parallelising linked data?

If I have some form of data that can have inherent links to all other data in the set but I wish to parallelise out this data in order to increase computation time or to reduce the size of any particular piece of data currently being worked on, is there a methodology to split this out into chunks without reducing the validity of the data?

For example, assume I have a grid of crime across the whole of a country. I wish to treat this grid as a heat map of crime and therefore "smear" the heat from a crime to nearby points.

However, the time to calculate this is too long if I try to do it across the whole of the country. Therefore, I want to split this grid out into manageable chunks.

But if I do this then high crime areas on the edges of the chunks on this grid will not be "smeared" into nearby areas. I don't want to lose this data validity.

What, if any, is the methodology to solve this linking of data in parallelisation?

Topic methodology optimization parallel

Category Data Science


I'd look into applying some kind of sliding window to you data so you can process one subset at the time. By adjusting the window overlapping with previous chunks you can control how much context have on each batch.

The implementation details will depend on what kind of data structure you use, but since you mentioned a grid in which you want to smear the data, a convolution+pooling approach might do the trick. It's very easy to run in parallel and can be super efficiently on modern GPUs.

Also, you example suggests that the result of your chunk processing would be some kind of subsampling. If the data is too large you could subsample it first. There are many ways to do it depending on which criteria is acceptable for discarding data points so you'll have to dig into it to figure out which method is best for your use case

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.