Why does my first Amazon sagemaker autogluon training job fail and keep failing?
I followed the instructions from this article about creating a code-free machine learning pipeline. I already had a working pipeline offline using the same data in TPOT (autoML). I uploaded my data to AWS, to try their autoML thing.
I did the exact steps that were described in the article and uploaded my _train and _test csv files, both with a column named 'target' that contains the target value.
The following error message was returned as a failure reason:
AlgorithmError: ExecuteUserScriptError: Command
/usr/local/bin/python3.6 autogluon-tab-with-test.py --filename \
CENSORED_train.csv --s3-output s3://code-free-automl-eu-west-1-
price-prediction/results/
The status history looks like the following:
**Status history**
Status
Start time
End time
Description
Starting
Jun 28, 2021 16:37 UTC
Jun 28, 2021 16:39 UTC
Preparing the instances for training
Downloading Jun 28, 2021 16:39 UTC Jun 28, 2021 16:39 UTC Downloading input data
Training Jun 28, 2021 16:39 UTC Jun 28, 2021 16:42 UTC Training image download completed. Training in progress.
Uploading Jun 28, 2021 16:42 UTC Jun 28, 2021 16:42 UTC Uploading generated training model
Failed Jun 28, 2021 16:42 UTC Jun 28, 2021 16:42 UTC Training job failed
The training image as provided by AWS:
763104351884.dkr.ecr.eu-west-1.amazonaws.com/mxnet-training:1.6.0-cpu-py3
Data: my data has about 8k rows and 41 columns of which about 30 are due to one-hot encoded variables.
Topic automl sagemaker cloud-computing aws
Category Data Science