Dividing a dataset to parallelize machine learning training on the cloud

Question

Dividing a dataset to parallelize machine learning training on the cloud

ptushev

2022年2月7日 07:06

I'm very new to machine learning. I am doing a project for a subject called parallel and distributed computing, in which we have to speed up a heavy computation using parallelism or distributed computing. My idea was to have a dataset divided in equal parts, and for each subset to have a neural network to be trained on a separate machine in the cloud. Once the models are trained, they would be returned back to me and somehow combined into a single model. I am aware of federated learning but it doesn't quite fit my scenario of actually sending and dividing the dataset into the cloud. Does someone know any feasible approaches (maybe a variant of federated learning) of how one would do this?

Topic federated-learning cloud machine-learning

Category Data Science

Brian Spiering · Accepted Answer · 2021年4月27日 16:05

There are many ways to parallelism machine learning. It is often better to distribute the model parameters, not the data.

Training models only a subset of data will result in worse parameter estimates than training a model on random samples of the data.

Additionally, moving data around is more expensive than moving parameters.

Dividing a dataset to parallelize machine learning training on the cloud

About