How to improve regression neural network?

I am new to deep learning and data science and trying to increase my knowledge by working on some hackathons. Currently, the hackathon project I am working on has the task to predict the closing price of crypto-currency based on 48 parameters with ~1200 records.

By far I was able to achieve some good accuracy from the model but still, my score is very low. I have tried many things from knowledge but it doesn't seem to be affecting the performance a bit. So I just want a little suggestion and tips, since there is scope to improve the performance.

Dataset

Here are some sample records from my dataset.

id asset_id open high low volume market_cap url_shares unique_url_shares reddit_posts reddit_posts_score reddit_comments reddit_comments_score tweets tweet_spam tweet_followers tweet_quotes tweet_retweets tweet_replies tweet_favorites tweet_sentiment1 tweet_sentiment2 tweet_sentiment3 tweet_sentiment4 tweet_sentiment5 tweet_sentiment_impact1 tweet_sentiment_impact2 tweet_sentiment_impact3 tweet_sentiment_impact4 tweet_sentiment_impact5 social_score average_sentiment news price_score social_impact_score correlation_rank galaxy_score volatility market_cap_rank percent_change_24h_rank volume_24h_rank social_volume_24h_rank social_score_24h_rank medium youtube social_volume percent_change_24h market_cap_global close
ID_322qz6 1 9422.849081 9428.490628 9422.849081 713198620.0 173763453624.0 1689.0 817.0 55.0 105.0 61.0 271.0 3420.0 1671.0 11675867.0 39.0 1343.0 448.0 2237.0 124.0 330.0 331.0 2515.0 120.0 506133.0 1326610.0 1159677.0 8406185.0 281329.0 11681999.0 3.6 69.0 2.7 3.6 3.3 66.0 0.0071176 1.0 606.0 2.0 1.0 1.0 2.0 5.0 4422 1.4345161346109587 281806567507.0 9428.279323
ID_3239o9 1 7985.359278 7992.059917 7967.567267 400475518.0 142694202230.96 920.0 544.0 20.0 531.0 103.0 533.0 1491.0 242.0 5917814.0 195.0 1070.0 671.0 3888.0 1.0 52.0 315.0 1100.0 23.0 1320.0 381117.0 1706376.0 3754815.0 80010.0 5924770.0 3.7 1.0 2.0 2.0 1.0 43.5 0.00941863 1.0 2159 -2.4595073021531104 212689713284.66 7967.567267
ID_323J9k 1 49202.033778 49394.593518 49068.057046 3017728869.0 916697653223.0 1446.0 975.0 72.0 1152.0 187.0 905.0 9346.0 4013.0 47778746.0 104.0 2014.0 1099.0 11476.0 331.0 923.0 864.0 6786.0 442.0 9848462.0 5178557.0 2145663.0 25510267.0 5110490.0 47796942.0 3.7 22.0 3.1 3.0 3.3 65.5 0.01353005 1.0 692.0 3.0 1.0 1.0 10602 4.942447794031182 1530711784042.0 49120.738484

The dataset has 48 features however, the model is performing well only with 5 columns that are ['open', 'high', 'low', 'market_cap', 'market_cap_global']

Model

I have tried a small neural network with only 2 hidden layers. And I have fed the model with the above 5 features which are scaled with a standard scaler. Apart from this, I also have utilized callbacks, early stopping, and a custom loss function for calculating rmse. Till now this is the best performing model I was able to create

# create model
model_dl2 = Sequential()
model_dl2.add(Dense(50, input_dim=5, activation='relu'))
model_dl2.add(Dense(75,  activation='relu'))
model_dl2.add(Dense(1,  activation='linear'))

# custom loss function
from keras import backend as k
def root_mean_squared_error(y_true, y_pred):
    return k.sqrt(k.mean(k.square(y_pred - y_true))) 

# callbacks
loss = ModelCheckpoint('Models/best_model2.h5', monitor='val_loss', verbose=1, save_best_only=True)
es = EarlyStopping(patience=500)

# Compile model
opt = tf.keras.optimizers.Adam(learning_rate=0.5, amsgrad=True)
model_dl2.compile(loss= root_mean_squared_error, optimizer=opt)

model_dl2.fit(x_trainS2, y_trainS2, validation_data=(x_testS2, y_testS2), epochs=3000, batch_size=128, callbacks=[loss, es])

## accuracy rmse:53

My attempt to increase the performance

The accuracy of the model is stuck around rmse of 53, I have tried many things such as

  • different activation function, optimizer functions with different learning rate
  • increased/decreased hidden layers neurons (vertical scaling)
  • increased/decreased neurons (horizontal scaling)
  • I tried to take PCA of the rest 43 or some selected columns

But none of this increased the accuracy.

Apart from this, Dataset also have few issues such as

  1. many null values in both target and features 'close', about ~30%
  2. multicollinearity
  3. skewness(right-skewed).

To solve these issues I have tried few things which weren't that helpful except the 1st one.

  1. For null values it seems to be working well if we fill it with 0's in both features and the target column. So not dropped any rows
  2. For skewness I tried to do Power transformation but it didn't work. Also, I can't do a log transformation because the dataset contains negative values. So basically did nothing
  3. Because of multicollinearity I used only 5 features (mentioned above) that are working well. However, these 5 features are also highly correlated and for that, I was relying on data transformation but it didn't work.

My question

My problems may sound very basic but I have applied many things that I have learned by myself and now I am out of ideas. I don't know what to do. Improving the dataset issue could be one solution but I don't know what to do, after trying those things. Also if the issue is in the model then it will be great if you can recommend some tuning that I may be missing

feel free to ask for more details if you need to.

Topic hyperparameter-tuning regression deep-learning neural-network data-cleaning

Category Data Science


These are great first attempts! However, neural networks are notoriously bad at working with tabular data. You'd might be better served using a traditional ML model (e.g., linear regression, SVM).

Regardless of whether you're using a neural net or otherwise, you should normalize/transform your input features and the output feature (i.e., your closing price). Transforming your inputs would remedy the right-skew problem that you're facing and shrink the overall scale of your regression data -- which helps your prediction models converge towards a minimum loss. I hope that this helps!

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.