Prediction issue with xgboost custom loss

I have an issue with xgboost custom objectives: I do not manage to get consistent forecasts. In other words, the scale of my forecasts is not in line with the values I would like to predict. I tried many custom loss, but I always get the same issue. import numpy as np import pandas as pd import xgboost as xgb from sklearn.datasets import make_regression n_samples_train = 500 n_samples_test = 100 n_features = 200 X, y = make_regression(n_samples_train, n_features,noise=10) X_test, y_test …
Category: Data Science

How do you do 1-vs-rest classifiers in XGBoost Library (Not Sklearn)?

I am working with a very large dataset that would benefit from using training continuation with the xgb_model parameter in xgb.train(). The label (Y) of dataset itself has 4 classes and is highly imbalanced, so I would like to generate per-label PR curves for it to evaluate its performance, and would thus need to treat each class as it's own binary problem using a one-vs-rest classifier. After a lot of reading I haven't found an equivalent to sklearn's OneVsRestClassifier in …
Category: Data Science

Is this XGBoost model tending to overfit?

Here is the list of hyperparameters that I used: params = { 'scale_pos_weight': [1.0], 'eta': [0.05, 0.1, 0.15, 0.9, 1.0], 'max_depth': [1, 2, 6, 10, 15, 20], 'gamma': [0.0, 0.4, 0.5, 0.7] } The dataset is imbalanced so I used scale_pos_weight parameter. After 5 fold cross validation the f1 score that I got is: 0.530726530426833
Category: Data Science

What should be the target vairable in CTR maximization problem?

I have a dataset that contains some user-specific detials like gender, age-range, region etc. and also the behavioural data which contains the historical click-through-rate (last 3 months) for different ad-types shown to them. Sample of the data is shown below. It has 3 ad-types i.e. ecommerce, auto, healthcare but the actual data contains more ad-types. I need to build a regression model using XGBRegressor that can tell which ad should be shown to a given new user in order to …
Category: Data Science

what is the strange stuff at top left of plot_convergence

I make a gp_minimize with 20 calls : the correct values of the functions are the blue bottom markers. What is the strange markers distributions at the top left ? optimize_result=gp_minimize(func_objective,space_xgboost,n_calls=n_calls,n_initial_points=n_initial_points,x0=x0_init,random_state=1) ax=plot_convergence(optimize_result,true_minimum=optimize_result.fun) ax.figure.tight_layout() ax.figure.savefig(f"results/convergence_hpo_skopt.png")
Topic: xgboost
Category: Data Science

GridSearch multiplying the number of trees in XGboost?

I'm having an issue: after running an XGboost in a HalvingGridSearchCV, I receive a certain number of estimators (50 for example), but the number of trees is constantly being multiplied by 3. I don't understand why. Here is the code: model = XGBClassifier(objective='multi:softprob', subsample = 0.9, colsample_bytree=0.5, num_class= 3) md = [3, 6, 10, 15] lr = [0.1, 0.5, 1] g = [0, 0.25, 1] rl = [0, 1, 10] spw = [1, 3, 5] ns = [5, 10, 20] …
Category: Data Science

How to make XGBOOST capture trend in time series forecasting?

I am trying to forecast some sales data with monthly values, I have been trying some classical models as well ML models like XGBOOST. My data with a feature set looks like this with a length of 110 months and I am trying to forecast for next 12 months, When it comes to XGBOOST, I've been spending time mostly on hyperparameter optimization with Gridsearch and also state-of-art packages like optuna. My currently best set of parameters looks like this, parameters …
Category: Data Science

Improving prediction accuracy with XGBoost

I have a 32x20 matrix for which I am trying to use XGBoost (Regression). I am looping through rows to produce an out of sample forecast. I'm surprised that XGBoost only returns an out of sample error (MAPE) of 3-4%. When I run the data through other algorithms (glmboost, boosted linear model), I get MAPEs around 1.8-2.5%. I'm surprised XGBoost is so deficient. I suspect I am under-optimizing hyperparameters. I include a gridsearch, which I ran below, but the error …
Category: Data Science

Why is XGBClassifier in Python outputting different feature importance values with the same data across different repetitions?

I am fitting an XGBClassifier to a small dataset (32 subjects) and find that if I loop through the code 10 times the feature importances (gain) assigned to the features in the model varies slightly. I am using the same hyperparameter values between each iteration, and have subsample and colsample set to the default of 1 to prevent any random variation between executions. I am using the scikit learn feature_importance_ function to extract the values from the fitted model. Any …
Category: Data Science

XGBClassifier's predictions are not probabilities with objective='binary:logistic'

I am using a XGBoost's XGBClassifier, a binary 0-1 target, and I am trying to define a custom metric function. It supposedly receives an array of predictions and a DMatrix with the training set according to the XGBoost Tutorials. I have used objective='binary:logistic' in order to get probabilities but the prediction values passed to the custom metric function are not between 0 and 1. They can be like between -3 and 5 and the range of values seems to grow …
Category: Data Science

Using Transaction Amount to Guide Learning in an Fraud Detection Machine Learning Model

I am currently using transaction amount as a feature in an XGBoost classification model designed to identify fraudulent transactions. Furthermore, transaction amount is bounded for this problem between 0 and 500. Using transaction amount as a feature does improve target class separability. However, I can't help but wonder if there is a better way to use this variable. To explain, I care more about getting the high transaction amount values correct than I do the low ones. However, the model …
Category: Data Science

Multiple XGBoost models or just 1 for a cetain type of category?

I am building a model to predict, say house prices. Within my data I have sales and rentals. The Y variable is the price of either the sales or rentals. I also have a number of X variables to predict Y, such as number of bedrooms, bathrooms, meters squared etc. I believe that the model will firstly make a split on the variable "sales" vs "rentals" as this would reduce the loss function - RMSE - the most. Do you …
Category: Data Science

Ignoring features in XGBoost by setting them as "missing"

I have some data n x m and I want to ignore certain features. One idea I had is to mark those features as "missing", since XGBoost can handle missing values by default, e.g. using nan when constructing the DMatrix: n, m = 100, 10 X = np.random.uniform(size=(n, m)) y = (np.sum(X, axis=1) >= 0.5 * m).astype(int) # ignore certain features: mark them as missing X[:, 2:7] = np.nan dtrain = xgb.DMatrix(X, label=y, missing=np.nan) model = xgb.train(params={'objective': 'binary:logistic'}, dtrain=dtrain) My …
Category: Data Science

Interpretation of SHAP summary plot in a multi class context

I'm performing multi-class classification and uses SHAP values to interpret the features. I have 3 classes. I have testet XGBoost and Multinomial Logistic Regression. When i'm using XGBoost I am able to get a summary plot where I can see the individual feature affect on all three classes. I'm also able to get a seperate plot for each class to see how small/large feature values affect the prediction towards the individual class. It seems like this is only possible to …
Category: Data Science

Is multicollinarity a problem when interpreting SHAP values from an XGBoost model?

I'm using an XGBoost model for multi-class classification and is looking at feature importance by using SHAP values. I'm curious if multicollinarity is a problem for the interpretation of the SHAP values? As far as I know, XGB is not affected by multicollinarity, so I assume SHAP won't be affected due to that?
Category: Data Science

Distribution of predicted probability is heavily skewed

I'm a beginner here. I'm just trying to use a xgboost method for classification learning problem. My data is 70-30 unbalanced. But I ran into a problem about the distribution of predicted probability is heavily skewed as a picture below. I need an advice to solve this one.
Topic: xgboost
Category: Data Science

Make fitted xgboost or linear regression model predicts values in thé future

I have a machine learning model that I fitted with xgboost and linear regression. My dataset has thirteen features and has price as the target. Is there any way to make the model predict values in the future? I have date time as one of the variables. From searching on internet, I learned about fb prophet, and that this is a time series problem. But if my xgboost is doing well, is there a way to make it predict future …
Category: Data Science

Correct theoretical regularized objective function for XGB/LGBM (regression task)

I am writing an academic paper on the application of Machine Learning methods to Time Series Forecasting and I am unsure about how to write down the theoretical part about the regularized objective function for XGBoosting. Below you can find the equation given by the developers of the XGB algorithm for the regularized objective function (equation 2). The paper is called "XGBoost: A Scalable Tree Boosting System" by Chen & Guestrin (2016). In the Python API from the xgb library …
Category: Data Science

XGBOOST with target column has categorical data and features also has categorical data

I have a huge dataset with the categorical columns in features and also my target variable is categorical. All the values are not ordinal so I think it is best to use one hot encoding. But I have one issue that my target variable have 90 classes so if I do one hot encoding there will be 90 columns as the target columns and it will become to much complex. But as all the values are not ordinal can I …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.