How to improve regression neural network?
I am new to deep learning and data science and trying to increase my knowledge by working on some hackathons. Currently, the hackathon project I am working on has the task to predict the closing price of crypto-currency based on 48 parameters with ~1200 records.
By far I was able to achieve some good accuracy from the model but still, my score is very low. I have tried many things from knowledge but it doesn't seem to be affecting the performance a bit. So I just want a little suggestion and tips, since there is scope to improve the performance.
Dataset
Here are some sample records from my dataset.
id | asset_id | open | high | low | volume | market_cap | url_shares | unique_url_shares | reddit_posts | reddit_posts_score | reddit_comments | reddit_comments_score | tweets | tweet_spam | tweet_followers | tweet_quotes | tweet_retweets | tweet_replies | tweet_favorites | tweet_sentiment1 | tweet_sentiment2 | tweet_sentiment3 | tweet_sentiment4 | tweet_sentiment5 | tweet_sentiment_impact1 | tweet_sentiment_impact2 | tweet_sentiment_impact3 | tweet_sentiment_impact4 | tweet_sentiment_impact5 | social_score | average_sentiment | news | price_score | social_impact_score | correlation_rank | galaxy_score | volatility | market_cap_rank | percent_change_24h_rank | volume_24h_rank | social_volume_24h_rank | social_score_24h_rank | medium | youtube | social_volume | percent_change_24h | market_cap_global | close |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ID_322qz6 | 1 | 9422.849081 | 9428.490628 | 9422.849081 | 713198620.0 | 173763453624.0 | 1689.0 | 817.0 | 55.0 | 105.0 | 61.0 | 271.0 | 3420.0 | 1671.0 | 11675867.0 | 39.0 | 1343.0 | 448.0 | 2237.0 | 124.0 | 330.0 | 331.0 | 2515.0 | 120.0 | 506133.0 | 1326610.0 | 1159677.0 | 8406185.0 | 281329.0 | 11681999.0 | 3.6 | 69.0 | 2.7 | 3.6 | 3.3 | 66.0 | 0.0071176 | 1.0 | 606.0 | 2.0 | 1.0 | 1.0 | 2.0 | 5.0 | 4422 | 1.4345161346109587 | 281806567507.0 | 9428.279323 |
ID_3239o9 | 1 | 7985.359278 | 7992.059917 | 7967.567267 | 400475518.0 | 142694202230.96 | 920.0 | 544.0 | 20.0 | 531.0 | 103.0 | 533.0 | 1491.0 | 242.0 | 5917814.0 | 195.0 | 1070.0 | 671.0 | 3888.0 | 1.0 | 52.0 | 315.0 | 1100.0 | 23.0 | 1320.0 | 381117.0 | 1706376.0 | 3754815.0 | 80010.0 | 5924770.0 | 3.7 | 1.0 | 2.0 | 2.0 | 1.0 | 43.5 | 0.00941863 | 1.0 | 2159 | -2.4595073021531104 | 212689713284.66 | 7967.567267 | ||||||
ID_323J9k | 1 | 49202.033778 | 49394.593518 | 49068.057046 | 3017728869.0 | 916697653223.0 | 1446.0 | 975.0 | 72.0 | 1152.0 | 187.0 | 905.0 | 9346.0 | 4013.0 | 47778746.0 | 104.0 | 2014.0 | 1099.0 | 11476.0 | 331.0 | 923.0 | 864.0 | 6786.0 | 442.0 | 9848462.0 | 5178557.0 | 2145663.0 | 25510267.0 | 5110490.0 | 47796942.0 | 3.7 | 22.0 | 3.1 | 3.0 | 3.3 | 65.5 | 0.01353005 | 1.0 | 692.0 | 3.0 | 1.0 | 1.0 | 10602 | 4.942447794031182 | 1530711784042.0 | 49120.738484 |
The dataset has 48 features however, the model is performing well only with 5 columns that are ['open', 'high', 'low', 'market_cap', 'market_cap_global']
Model
I have tried a small neural network with only 2 hidden layers. And I have fed the model with the above 5 features which are scaled with a standard scaler. Apart from this, I also have utilized callbacks
, early stopping
, and a custom loss function for calculating rmse.
Till now this is the best performing model I was able to create
# create model
model_dl2 = Sequential()
model_dl2.add(Dense(50, input_dim=5, activation='relu'))
model_dl2.add(Dense(75, activation='relu'))
model_dl2.add(Dense(1, activation='linear'))
# custom loss function
from keras import backend as k
def root_mean_squared_error(y_true, y_pred):
return k.sqrt(k.mean(k.square(y_pred - y_true)))
# callbacks
loss = ModelCheckpoint('Models/best_model2.h5', monitor='val_loss', verbose=1, save_best_only=True)
es = EarlyStopping(patience=500)
# Compile model
opt = tf.keras.optimizers.Adam(learning_rate=0.5, amsgrad=True)
model_dl2.compile(loss= root_mean_squared_error, optimizer=opt)
model_dl2.fit(x_trainS2, y_trainS2, validation_data=(x_testS2, y_testS2), epochs=3000, batch_size=128, callbacks=[loss, es])
## accuracy rmse:53
My attempt to increase the performance
The accuracy of the model is stuck around rmse of 53, I have tried many things such as
- different activation function, optimizer functions with different learning rate
- increased/decreased hidden layers neurons (vertical scaling)
- increased/decreased neurons (horizontal scaling)
- I tried to take PCA of the rest 43 or some selected columns
But none of this increased the accuracy.
Apart from this, Dataset also have few issues such as
- many null values in both target and features 'close', about ~30%
- multicollinearity
- skewness(right-skewed).
To solve these issues I have tried few things which weren't that helpful except the 1st one.
- For null values it seems to be working well if we fill it with 0's in both features and the target column. So not dropped any rows
- For skewness I tried to do Power transformation but it didn't work. Also, I can't do a log transformation because the dataset contains negative values. So basically did nothing
- Because of multicollinearity I used only 5 features (mentioned above) that are working well. However, these 5 features are also highly correlated and for that, I was relying on data transformation but it didn't work.
My question
My problems may sound very basic but I have applied many things that I have learned by myself and now I am out of ideas. I don't know what to do. Improving the dataset issue could be one solution but I don't know what to do, after trying those things. Also if the issue is in the model then it will be great if you can recommend some tuning that I may be missing
feel free to ask for more details if you need to.