Very low accuracy of new data compared to validation data

Question

Very low accuracy of new data compared to validation data

Vyacheslav Obrezkov

2018年4月25日 15:03

I'm trying to train the neural network to predict the movement of a particular security on the market.

I teach on historical data collected for the year. At the entrance of the neural network candlesticks are served: close price and value

Before submitting, these data are normalized separately for each dataset. This happens with the z Score algorithm. Then the question immediately arises... output can not be obtained in the limit [0;1] or [-1;1] and can reach up to 10 or more (in both directions). Is that okay?

All sets are first shuffled, and then divided into two parts:: train and test (80:20) The output is a class: the price will go up or down.

For train, the error is reduced to zero and the accuracy to 100%. For test, the error reaches a minimum of 0.28 and the accuracy reaches 90%

I tried to test this neural network to predict the next two months, which were not used in this neural network. They went through exactly the same normalization. However, the accuracy of this forecast is 0.5289256204750912. And the error - 1.8223290671986982

I know I'm doing something wrong, but I don't know what... I hope someone can help me figure this out. PS: ZScore was used To the usual min, max. However, this particular success was not observed.

Topic market-basket-analysis accuracy classification machine-learning

Category Data Science

kbrose · Accepted Answer · 2018年4月25日 13:54

You are experiencing Data Leakage. In a comment, you explained that you shuffle your data before splitting into train/validation. For each validation point, you likely are showing the model data that is nearby temporally both before and after the time of the validation point. This is information that the model can not possibly hope to have when running in real time.

To alleviate this, I would keep the data in the correct time order, and instead take the validation data as a contiguous chunk of time. If you want to be the most careful, you could throw data away in a small window around the validation data, so that the training data won’t contain information about the edges of the validation data.

This appears to be a good resource, but I have not read it too thoroughly: https://www.kaggle.com/dansbecker/data-leakage

Very low accuracy of new data compared to validation data

About