Why does my first Amazon sagemaker autogluon training job fail and keep failing?

Question

Why does my first Amazon sagemaker autogluon training job fail and keep failing?

Alain van Rijn

2021年6月29日 19:54

I followed the instructions from this article about creating a code-free machine learning pipeline. I already had a working pipeline offline using the same data in TPOT (autoML). I uploaded my data to AWS, to try their autoML thing.

I did the exact steps that were described in the article and uploaded my _train and _test csv files, both with a column named 'target' that contains the target value.

The following error message was returned as a failure reason:

AlgorithmError: ExecuteUserScriptError: Command 
/usr/local/bin/python3.6 autogluon-tab-with-test.py --filename \
CENSORED_train.csv --s3-output s3://code-free-automl-eu-west-1-
price-prediction/results/

The status history looks like the following:

**Status history**

Status
Start time
End time
Description

Starting
Jun 28, 2021 16:37 UTC
Jun 28, 2021 16:39 UTC
Preparing the instances for training

Downloading Jun 28, 2021 16:39 UTC  Jun 28, 2021 16:39 UTC  Downloading input data

Training    Jun 28, 2021 16:39 UTC  Jun 28, 2021 16:42 UTC  Training image download completed. Training in progress.

Uploading   Jun 28, 2021 16:42 UTC  Jun 28, 2021 16:42 UTC  Uploading generated training model

Failed  Jun 28, 2021 16:42 UTC  Jun 28, 2021 16:42 UTC  Training job failed

The training image as provided by AWS:

763104351884.dkr.ecr.eu-west-1.amazonaws.com/mxnet-training:1.6.0-cpu-py3

Data: my data has about 8k rows and 41 columns of which about 30 are due to one-hot encoded variables.

Topic automl sagemaker cloud-computing aws

Category Data Science

Why does my first Amazon sagemaker autogluon training job fail and keep failing?

About