Alternative to EC2 for running ML batch training jobs on AWS

Question

Alternative to EC2 for running ML batch training jobs on AWS

Cybernetic

2021年6月29日 15:07

We are building an ML pipeline on AWS, which will obviously require some heavy-compute components including preprocessing and batch training.

Most the the pipeline is on Lambda, but Lambda is known to have time limits on how long a job can be run (~15mins). Thus for the longer running jobs like batch training of ML models we will need(?) to access something like EC2 instances. For example a lambda function could be invoked and then kick off an EC2 instance to handle the training.

Are there any alternatives to using EC2 for the heavy compute? Is there a way to still host/run the job on AWS without leveraging any EC2 to do the compute?

The idea is to avoid the extra management that comes with EC2 since we’re not currently using it. Keeping everything ad close to Lambda-like as possible is ideal.

Topic data-engineering aws-lambda pipelines aws machine-learning

Category Data Science

prashant0598 · Accepted Answer · 2021年6月29日 15:07

For batch training i have been utilizing sagemaker though it's a bit expensive then ec2 but it's easy to setup and get started.

Make a docker container and push it to ecr then start the training and track the metrics using any monitoring tool like wandb
If your use case don't require any custom packages then you can also utilize HuggingFace DLC it which can make it more easy to start training.

References:

Alternative to EC2 for running ML batch training jobs on AWS

About