Why does my first Amazon sagemaker autogluon training job fail and keep failing?

I followed the instructions from this article about creating a code-free machine learning pipeline. I already had a working pipeline offline using the same data in TPOT (autoML). I uploaded my data to AWS, to try their autoML thing.

I did the exact steps that were described in the article and uploaded my _train and _test csv files, both with a column named 'target' that contains the target value.

The following error message was returned as a failure reason:

AlgorithmError: ExecuteUserScriptError: Command 
/usr/local/bin/python3.6 autogluon-tab-with-test.py --filename \
CENSORED_train.csv --s3-output s3://code-free-automl-eu-west-1-
price-prediction/results/

The status history looks like the following:

**Status history**

Status
Start time
End time
Description

Starting
Jun 28, 2021 16:37 UTC
Jun 28, 2021 16:39 UTC
Preparing the instances for training

Downloading Jun 28, 2021 16:39 UTC  Jun 28, 2021 16:39 UTC  Downloading input data

Training    Jun 28, 2021 16:39 UTC  Jun 28, 2021 16:42 UTC  Training image download completed. Training in progress.

Uploading   Jun 28, 2021 16:42 UTC  Jun 28, 2021 16:42 UTC  Uploading generated training model

Failed  Jun 28, 2021 16:42 UTC  Jun 28, 2021 16:42 UTC  Training job failed

The training image as provided by AWS:

763104351884.dkr.ecr.eu-west-1.amazonaws.com/mxnet-training:1.6.0-cpu-py3

Data: my data has about 8k rows and 41 columns of which about 30 are due to one-hot encoded variables.

Topic automl sagemaker cloud-computing aws

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.