How to reduce RMS error value in regression analysis & predictions - feature engineering, model selection

There's this dataset containing the metadata of Twitch's top 1,000 streamers of 2020. You can have the details here. I am currently participating in a challenge to predict the values for Followers gained, by creating and training the model using the remaining features from the dataset. The kernel objective is to get the lowest RMSE (Root-Mean Squared Error) metric value from the model's predictions. Until now, I have made numerous attempts to lower down the RMSE loss value as much …
Category: Data Science

Which error metric is good for measuring accuracy

I am estimating water depth with satellite data (predicted value) and would like to validate my result using bathymetry lidar data collected on the field and believed to be more accurate (observed value). I have different observations at each water depth. For example, number of observations at water depth range of 0-10 m are 300, where as values at deeper depth range (10 - 20 m) are less (~50 points). I have been using RMSE (as I would like to …
Topic: rmse metric
Category: Data Science

Measuring performance of customer purchase predictions

My goal is to develop a model that predicts next customer purchases in USD (Update: During the time period of the dataset, if no purchase was made by the customer, the next purchase label is set to zero). I am trying to determine what would be the most effective metric for measuring the model's performance. Results looks like so: y_true_usd y_predicted_usd 1.2 0.8 0 0.3 0 1.1 0 0 0 0.1 5.3 4.3 First I thought about going with RMSE, …
Category: Data Science

Feature engineering: The more features I add the better RMSE I get?

I have a model with 7 features, I'm trying to figure out if I can improve the performance of this model by adding additional features. So I'm relying on the RMSE to measure the accuracy of my predictions. from 7 features I get to 25 features and with each time I add a new feature, the RMSE slightly gradually get better (smaller). I find it hard to believe that all of these features improved the performance of my model as …
Category: Data Science

High loss but low rmse, how?

I have trained an lstm model on a dataset but its loss during training is ten times than the rmse during test. How is it possible, and can I use this model if rmse is very low but loss is high? How can I improve training and test loss?
Category: Data Science

How many features do I select when doing feature selection for regression algorithms? Is R2 and RMSE good measures of success for overfitting?

Context: I'm currently crafting and comparing machine learning models to predict housing data. I have around 32000 data points, 42 features, and I'm predicting housing price. I'm comparing Random Forest Regressor, Decision Tree Regressor, and Linear Regression. I can tell there is some overfitting going on, as my initial values vs cross validated values are as follows: RF: 10 Fold R Squared = 0.758, neg RMSE = -540.2 vs unvalidated R Squared of 0.877, RMSE of 505.6 DT: 10 Fold …
Category: Data Science

What does rmse of a LSTM model tells?

Suppose I made a model which has rmse of 50 Now when I predict the next data which is 500 So does that mean the actual value has high probability to be within the range of 450 - 550 ? If so what is the probability that it will be in this range? Or it means the actual value has high probability to be within the range of 475- 525 ? If so what is the probability that it will …
Category: Data Science

Appropriate loss function and metrics for regression task with mixed outputs

I'm trying to train an EfficientNet-based Keras model that takes an image as input and returns two numeric values as output. Here's the model: def prepare_model_eff(input_shape): inputs = Input(shape=input_shape) x = EfficientNetB3(include_top=False, input_shape=input_shape)(inputs) x.trainable = True x = layers.GlobalAveragePooling2D()(x) x = layers.Dropout(rate=0.1, )(x) x = layers.BatchNormalization()(x) out_1 = layers.Dense(1, activation='linear', name='out_1')(x) out_2 = layers.Dense(1, activation='linear', name='out_2')(x) model = Model(inputs=inputs, outputs=[out_1, out_2]) As far as I know, the most common metric for such tasks is Root Mean Square Error (RMSE): def …
Category: Data Science

Determining which model result is better

I am trying to determine which model result is better. Both results are trying to achieve the same objective, the only difference is the exact data that is being used. I used random forest, xgboost, and elastic net for regression. Here is one of the results that has low rmse but not so good r2 model n_rows_test n_rows_train r2 rmse rf 128144 384429 0.258415240861579 8.44255341472637 xgb 128144 384429 0.103772500839367 9.28116624462333 e-net 128144 384429 0.062460300392487 9.49266713837073 The other model run has …
Category: Data Science

Perform bootstrapping of an ordinary linear regression model, using B=100 bootstrap resamples of my dataset, and getting RMSE

So Im studying machine learning through R, and Im working with the boston data set from the library MASS. I am practicing bootsrapping. I already carried out analysis to determine how ,many distinct data points on average are drawn from the sample to make up a bootsratp resample, using B=100 resamples of the dataset. Next I would like to do two things- perform boostrapping of an ordinary linear regression model using B=100 resamples of the data set again and use …
Category: Data Science

Difference in result in every run of Neural network?

I have written a simple neural network (MLP Regressor), to fit simple data frame columns. To have an optimum architecture, I also defined it as a function to see whether it is converging to a pattern. But every time that I run the model, it gives me a different result than the last time that I tried, and I do not know why? Due to the fact that it is fairly difficult to make the question reproducible, I can not …
Category: Data Science

Comparing RMSEs of multiple test sets having different sizes

The data I have is a time series data (stock returns), and I am training a Random Forest Regressor on it. Total observations = 2499 To better evaluate the performance, I have implemented rolling windows testing with training window sizes = 500, 700, 900,..., 2100. Though instinctively it would seem obvious to choose a window size which produced lowest RMSE, how can I be sure that the comparison is fair? I mean with increasing window size, the test set size …
Category: Data Science

Low MAE, RMSE, RMSLE and MAPE, but also a low R^2

I have a dataframe containing the IDs of 2000 questions, a list of scores representing difficulty, and the following features: how often the question was answered, how often the answer has been changed because the students were undecided, a normalized "frequency of changing the answers" (so the last two feature divided) and the average time spent on a question. The most important seems to be this normalized frequency (50%), then the average time (22%), how often the question was answered …
Category: Data Science

Why is linear regression not doing worse with a low weighted attribute?

I've been able to build a few linear regression models that can predict a material strength quite well: minimum RMSE of 17.95 using 11 attributes that I have selected from 159 original attributes. The data is distributed with mean=234.4 and stdev=19.9. I am working in Orange3. When using only the highest weighted attribute (weight 8.013) the model calculates RMSE of 18.767. If I use only the lowest weighted attribute (weight 0.051) the RMSE is 20.007. The difference is 1.24, or …
Category: Data Science

What is bad, good and excellent metric score for time series model?

I have created a couple of models for my master project and I used several metrics for evaluation. I used MSE, MAE, MAPE, RMSE not because I really learned about them a lot, because I saw in many other projects these metrics being used. Now I have a problem, I need to interpret results. I search for some articles or some studies that classify metrics performance as good or bad or excellent. The only material I found now is this …
Category: Data Science

How to add RMSE value on a plot with ggplot

I added r2 value and the formula of the regression function but I also want RMSE value on my plot, maybe I need to add something but I could not see a proper answer to this question neither here nor google... ggplot(data = AGB.rf$pred) + geom_point(mapping = aes(x = pred, y = obs, color = pred, shape=1))+ geom_smooth(mapping = aes(x = pred, y = obs), method="lm", se = FALSE)+ stat_cor(aes(x = pred, y = obs, label = ..rr.label..),label.y = 3000)+ …
Topic: rmse ggplot2 r
Category: Data Science

How to interpret the Mean squared error value in a regression model?

I'm working on a simple linear regression model to predict 'Label' based on 'feature'. The two variables seems to be highly correlate corr=0.99. After splitting the data sample for to training and testing sets. I make predictions and evaluate the model. metrics.mean_squared_error(Label_test,Label_Predicted) = 99.17777494521019 metrics.r2_score(Label_test,Label_Predicted) = 0.9909449021176512 Based on the r2_score my model is performing perfectly. 1 being the highest possible value. But when it comes to the mean squared error, I don't know if it shows that my model …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.