Negative R2_score Bad predictions for my Sales prediction problem using LightGBM

Question

Negative R2_score Bad predictions for my Sales prediction problem using LightGBM

Gopik Anand

2022年4月18日 05:05

My project involves trying to predict the sales quantity for a specific item across a whole year. I've used the LightGBM package for making the predictions. The params I've set for it are as follows:

params = {
'nthread': 10,
'max_depth': 5, #DONE
'task': 'train',
'boosting_type': 'gbdt',
'objective': 'regression_l1',
'metric': 'mape', # this is abs(a-e)/max(1,a)
'num_leaves': 2, #DONE
'learning_rate': 0.2180, #DONE
'feature_fraction': 0.9, #DONE
'bagging_fraction': 0.990, #DONE
'bagging_freq': 1, #DONE
'lambda_l1': 3.097758978478437, #DONE
'lambda_l2': 2.9482537987198496, #DONE
'verbose': 1,
'min_child_weight': 6.996211413900573,
'min_split_gain': 0.037310344962162616,
'min_data_in_bin': 1, #DONE
'min_data_in_leaf':2, #DONE
'num_boost_round': 1, #DONE
'max_bin': 7, #DONE
'extra_trees': True, #DONE
'early_stopping_rounds':-1
}

My dataset consists of daily sales data (columns= date, quantity) for the years 2017, 2018, 2019 and 3 months of 2020. I've been trying to use the 2017 and 2018 data for training and cross-validation and trying to test it for 2019 data. However my predictions for the year is way off the mark while considering the quantities on a weekly, monthly, quarterly or yearly basis (error ~ 40-50%)(I've tuned the params to bring the error down to this values). Moreover while considering the predictions, my r2_score is giving me a negative value of around -2.9148426301633803. Any suggestions on what can be done to make it better?

Script for lightgbm:

lgb_train = lgb.Dataset(train_x, train_y)
lgb_valid = lgb.Dataset(test_x, test_y)
model = lgb.train(params, lgb_train, \
                  valid_sets=[lgb_train, lgb_valid],\
                  verbose_eval=50)
test_df_pred = df[(df.date = '2019-01-01')  (df.date  '2020-01-01')]
#test_df_pred = df[(df.date = '2019-01-01')  (df.date  '2019-02-01')]
#test_df_pred = df[(df.date = '2019-01-15')  (df.date  '2019-01-22')]
test_df_pred['month'] = test_df_pred['date'].dt.month
test_df_pred['day'] = test_df_pred['date'].dt.dayofweek
test_df_pred['year'] = test_df_pred['date'].dt.year
col = [i for i in test_df_pred.columns if i not in ['date','id', 'qty']]
y_test_pred = model.predict(test_df_pred[col])
test_df_pred['qty_pred'] = y_test_pred
mse = mean_squared_error(y_true=test_df_pred['qty'], y_pred=test_df_pred['qty_pred'])
mae = mean_absolute_error(y_true=test_df_pred['qty'], y_pred=test_df_pred['qty_pred'])
mape = mean_absolute_percentage_error(y_true=test_df_pred['qty'], y_pred=test_df_pred['qty_pred'])
qty = test_df_pred.qty.sum()
qty_pred = test_df_pred.qty_pred.sum()
diff = qty_pred - qty

Topic lightgbm xgboost time-series python predictive-modeling

Category Data Science

Shahriyar Mammadli · Accepted Answer · 2020年11月3日 19:02

I assume you are new to the field, thus, I would suggest using tutorials to achieve your goal. Because what you did is completely wrong and your approach is incorrect. I guess you want to model the sales as time series without using any predictor instead you want to model future values by looking at the past values. To achieve that, you need to use algorithms like ARIMA, exponential smoothing, etc. Here what you have done is trying to correlate the year, month, and day with the sales, which does not possess any information about the sale as expected (also you decoded it wrongly). Thus, your performance metric shows you a negative result. As a reference, check these which are similar to your problem. Source1, Source2, Source3. These will solve your issue.

Negative R2_score Bad predictions for my Sales prediction problem using LightGBM

About