I am fitting an XGBClassifier to a small dataset (32 subjects) and find that if I loop through the code 10 times the feature importances (gain) assigned to the features in the model varies slightly. I am using the same hyperparameter values between each iteration, and have subsample and colsample set to the default of 1 to prevent any random variation between executions. I am using the scikit learn feature_importance_ function to extract the values from the fitted model. Any …
What is the easiest and easy to explain feature importance calculation for linear regression? I know I can use Shap to compute feature importance, but I find it difficult to explain it to stakeholders, and the coefficient is not a good measure of feature importance since it depends on the scale of the feature. Some suggested (standard deviation of feature)*feature coefficient as a good measure of feature importance.
I noticed that I am getting different feature importance results with each random forest run even though they are using the same parameters. Now, I know that a random forest model takes observations randomly which is causing the importance levels to vary. This is especially shown for the less important variables. My question is how does one interpret the variance in random forest results when running it multiple times? I know that one can reduce the instability level of results …
Im using fastai to train a network on tabular data (https://docs.fast.ai/tutorial.tabular.html). I have a table with 5 columns, each of these is the specific attribute that describes a galaxy and helps to classify it into two types: elliptical and spiral. My question is: Is it possible to get the value of which of these attributes is most helpful/least helpful for the training? I mean some king of ranking.
A random forest model outputs the following importance values. How do I interpert them for feature selection? If it's the mean decreased accuracy does that mean that by removing them from the model the accuracy should increase?
I ran into this problem: A XGBoost model(.pickle file , constrcuted under V0.7.post3) with 100 features in it ; But I found 55 features in model (model.feature_importances_) show 0 feature importance (like matrix below); Additionally, when I transformed the pickle file to PMML(to launch online), only 45 features in PMML file (those ones with importance>0 apparently); So, my question is: --why features with importance equal to 0 ending up in a XGB model ? And why they remain in the …
I have a machine learning problem with about 160 features and 400 cases and I want to find the best predictors for a continuous outcome. The dataset contains variables of psychotherapists and clients. I want to predict therapy outcome. I used lasso regression in nested 20-fold cross-validation and could identify about 20 top predictors (model fit about 0.97 nrmse). (I decided not to create a seperate holdout dataset, because I have too few cases.) However, I thought I could improve …
I am creating a linear regression model for energy usage in a food processing plant. Unfortunately, I don't have the historical data for one of the critical features (I know it is important from experience). If I go ahead with the modelling excluding this feature, what will be its impact on my model performance and especially on the feature importance. Can I trust the feature importance in the absence of this feature, or would the model over attribute the importance …
I have run an XGBClassifier using the following fields: - predictive features = ['sbp','tobacco','ldl','adiposity','obesity','alcohol','age'] - binary target = 'Target' I have produced the following Features Importance plot: I understand that, generally speaking, importance provides a score that indicates how useful or valuable each feature was in the construction of the boosted decision trees within the model. The more an attribute is used to make key decisions with decision trees, the higher its relative importance. From the list of 7 predictive …
I have more of a conceptual question I was hoping to get some feedback on. I am trying to run a boosted regression ML model to identify a subset of important predictors for some clinical condition. The dataset includes over 100000 rows, and close to 1000 predictors. Now, the etiology of the disease we are trying to predict is largely unknown. Thus, we likely don’t have data on many important predictors for the condition. That is to say, as a …
How do you ascertain which variables lead to the greatest increase in another variable of interest? Let's say you have a correlation matrix. You look at the row of the variable you are particularly curious about, retention, and see that income is the most correlated with it out of all the variables in the matrix. I would then expect when I look at the highest income cities in my dataset to see them having highest retention but am not finding …
I am training an XGBoost model, xgbr, using xgb.XGBRegressor() with 13 features and one numeric target. The R2 on the test set is 0.935, which is good. I am checking the feature importance by for col,score in zip(X_train.columns,xgbr.feature_importances_): print(col,score) When I check the importance type by xgbr.importance_type, the result is gain. I have a feature, x1, whose importance seems to be 0.0068, not so high. x1 is a categorical feature with a cardinality of 5122, and I apply LabelEncoder before …
For neural network feature importance, can I zero-out all features except one in order to gauge that feature's importance? I know shuffling a feature is one approach. For example, leaving in the 4th feature. feature_4 = [ [0.,0.,0.,1.15,0.] [0.,0.,0.,1.76,0.] [0.,0.,0.,2.31,0.] [0.,0.,0.,0.94,0.] ] _, probabilities = model.predict(feature_4) The non-linear output of activation functions worries me because activation of the whole is not equal to the sum of individual activations: from scipy.special import expit #aka sigmoid >>> expit(2.0) 0.8807970779778823 >>> expit(1.0)+expit(1.0) 1.4621171572600098 …
I am wondering if there is a way to check the feature importance for each class in a binary classification task separately. Or any way to check the correlation between features and both target classes separately?
I want to audit the results of regressions I ran, and hopefully gain more insights about a treatment effect through sklearn's feature importance function (permutation_importance), or eli5's PermutationImportance. I know that those are generally used to narrow down the number of predictors in a model, in an attempt to increase its accuracy (feature selection). My specific problem is that I do not want to use FI for feature selection, but for direct interpretation of the importance of the variables in …
We have a data table that accumulates the control and monitoring parameters of the High-Temperature Superconductor (HTS) production process: such that the rows represent the observations and columns represent the parameters mentioned above. Due to the nature of the production process, there are time dependencies between the rows of our data sets. Thus the columns, are, indeed, time series. (Which boils down our data to time-dependent data.) Now the question arises: whether we can apply induced causation methods, explained in …
An important aspect of tuning a model is assessing feature importance. In Keras, how to assess the importance of a categorical feature which is one-hot encoded? E.g. if a categorical feature is ice_cream_colour with a cardinality of 12 then I can assess the individual importances of ice_cream_colour_blue, ice_cream_colour_red, etc, but how to do it for the entire ice_cream_colour feature? A naïve approach would be to sum all individual importances, but this assumes that the relationship between distinct feature importances is …
I have a website and have info from Google Analytics. So I can see the following "features": page url country device category (phone, desktop, etc.) Number of sessions Number of users: users who have initiated at least one session during the date range Avg. time on page Page views Bounce rate -- a probability calculated as single-page sessions divided by all sessions, or the percentage of all sessions on your site in which users viewed only a single page (e.g. …