How to fit a model on validation_data?

Question

How to fit a model on validation_data?

warriorforce

2022年6月2日 07:01

can you help me understand this better? I need to detect anomalies so I am trying to fit an lstm model using validation_data but the losses does not converge. Do they really need to converge? Does the validation data should resemble train or test data or inbetween? Also, which value should be lower, loss or val_loss ? Thankyou!

Topic lstm keras anomaly-detection regression

Category Data Science

Theudbald · Accepted Answer · 2022年6月2日 07:01

When validating machine learning models, you have to use a validation procedure that is consistent with your problem. For an anomaly detection use-case, it means correctly split your data, evaluate your model and with the right metrics.

Split of the data

You have to correctly choose the way you are splitting your data. By default, you have to define three different sets : training, validation and the test sets.

The train-validation-test split is the most appropriate if the observations are well independent and the notion of time is not important in your problem. It is the best one because the distribution of your training data should be similar to your validation and test datasets.

Exemple 1 : To detect anomalies in banking transactions, the observations are independent and time is not important. Train-validation-test split seems to be an appropriate choice.

Exemple 2 : To detect anomalous temperatures in time series, time is an important variable because it might be possible to learn these anomalous temperatures from future data, which would then introduce a look-forward bias. In that situation, refer to sklearn TimeSeriesSplit. If you have few observations, you can also take a look on cross-validation.

Because you are using LSTM models which are designed for time series modeling, I guess you might be in the second configuration.

Which loss to minimize ?

You have every time to minimize the validation loss function. The correct model selection would be such as :

Select a set of models and features to optimize :
For each model :

Train the model on the train set.
Evaluate your model on the validation set.

Select your best model according to your validation set metrics.
Evaluate it one and only one time on the test set.

As you want to minimize the loss on the validation set, you don't especially need to converge on the training set. For example, in an overfitting situation, you can obtain a very low loss on the training set but a very high loss on the validation set due to overfitting. Test set metrics is your true compass.

Which metrics to use ?

For an anomaly detection use-case, you have to carefully choose your metrics.

For most use-cases, the accuracy metrics will be very bad as the distribution of your labels is imbalanced and the positive labels (the anomalies) normally are more important than the non-anomaly class.

You have to select a metrics appropriate with respect to the previous reasons + the problem you are searching to solve.

How to fit a model on validation_data?

Split of the data

Which loss to minimize ?

Which metrics to use ?

About