predictor-importance

Interpreting the variance of feature importance outputs with each random forest run using the same parameters

Detr4

2022年5月13日 18:46

I noticed that I am getting different feature importance results with each random forest run even though they are using the same parameters. Now, I know that a random forest model takes observations randomly which is causing the importance levels to vary. This is especially shown for the less important variables. My question is how does one interpret the variance in random forest results when running it multiple times? I know that one can reduce the instability level of results …

Topic: feature-importances predictor-importance random-forest machine-learning

Category: Data Science

SHAP KernelExplainer AttributeError numpy.ndarray

student

2022年4月14日 07:42

I've developed a text classifier of the form of python function that can input a np.array of strings (each string is one observation). def model(vector_of_strins): ... # do smthg return vec_of_probabilities # like [0.1, 0.23, ..., 0.09] When I try to use KernelExplainer from shap package like that test_texts = pd.Series(['text1','text2','text3']) shap.KernelExplainer(model, test_texts ) I receive the following error: AttributeError: 'numpy.ndarray' object has no attribute 'find' What can I do about it?

Topic: shap explainable-ai predictor-importance nlp python

Category: Data Science

How to have absolute importance of predictor variables in boosted regression trees (BRT) model?

xiaoling wang

2022年2月22日 02:05

Thank you very much for your time. I would like to compare the contribution of predictor variables (X1) from different BRT (Boosted regression trees) models, which are from different spatial scales (Radius= 250m, 160m, 80m, 40m, 10m). But the contribution in BRT is a relative contribution. I want to know each variable's absolute contribution (absolute importance), then I can compare them. I used the formula "R-squared = 1 - (Residual sum of squares/Total sum of squares)" to calculate R-squared (total …

Topic: gradient-boosting-decision-trees predictor-importance machine-learning-model random-forest r

Category: Data Science

Is it possible to cluster data according to a target?

Tanguy

2021年12月1日 12:10

I was wondering if there exists techniques to cluster data according to a target. For example, suppose we want to find groups of customers likely to churn: Target is churn. We want to find clusters exhibiting the same behaviour according to the fact that they are likely to churn (or not). Therefore, variables not explaining churn behaviour should not influence how clusters are built. I have done this analysis the following way: Predict target (e.g. using a Random Forest) and …

Topic: predictor-importance predictive-modeling clustering

Category: Data Science

Categorical variables: create a risk class or include in the model?

Matta

2021年8月17日 16:45

I think this is a very basic question so sorry for the wordy format. I am trying to get my head around it. I am thinking about predicting earthquake damage to property in the US using a GLM algorithm. I start with my predictor data, say: State (categorical), Owner wealth bracket (discrete), Bedrooms in property (discrete), Earthquake resistance number (discrete), and my response variable: Claim amount in a given year (continuous). I may decide at the beginning to split my …

Topic: predictor-importance machine-learning-model categorical-data

Category: Data Science

Permutation importance of features

Math

2021年5月18日 16:29

This agnostic-model is not well addressed in research papers. I read articles where it was used to test the accuracy of the models, trying to understand the importance of individual features and their contribute to the model. I saw values ranging from negative values to 10 or even more. I am wondering which would be the expected values from a such method and which considerations should be done. I would expect that, after extracting many features from data and building …

Topic: predictor-importance evaluation feature-extraction feature-selection machine-learning

Category: Data Science

How to check for "statistical significance" of categorical feature in black box models

Xaume

2021年3月2日 12:01

Let's say we have a categorical feature $X_i$ and we have build a black-box classification model like xgboost with $X_i$ as one of many predictors. We'd like to ask a question: does $X_i$ affects the overall prediction and, if so, how much? In particular $X_i$ could be: a dichotomous variable a n-level variable where we are interested in the potential difference between two particular levels In white-box models like linear regression we have tests to obtain statistical significance. But can …

Topic: predictor-importance xgboost machine-learning

Category: Data Science

How to evaluate the "importance" of a variable in a function

Allan Araujo

2021年1月5日 07:20

Let's say that we have $$f(x,y,z) = x/k - (y/k) ((z - x/k)/(z - y/k))$$ $$k = constant \in ]0,1[$$ And I need to show in some way that the variable $x$ is more important in some metric that I don't know which one could be good. I thought about analyze the partial derivatives of that functions, but I don't think that is a good way, because one will only see some restricted path through the surface. Another approach would …

Topic: explainable-ai predictor-importance

Category: Data Science

XGBoost Feature Importance, Permutation Importance, and Model Evaluation Criteria

Charles

2020年11月17日 13:44

I have built an XGBoost classification model in Python on an imbalanced dataset (~1 million positive values and ~12 million negative values), where the features are binary user interaction with web page elements (e.g. did the user scroll to reviews or not) and the target is a binary retail action. My ultimate goal was not so much to achieve a model with an optimal decision rule performance as to understand which user actions/features are important in determining the positive retail …

Topic: predictor-importance xgboost evaluation classification

Category: Data Science

Aggregate SHAP importances from different models

CaffeineMan

2020年10月22日 22:14

A couple of questions on the SHAP approach to the estimation of feature importance. I would like to use the random forest, logistic regression, SVM, and kNN to train four classification models on a dataset. Parameters in each training are chosen to give the best accuracy and precision for every model. A feature has a different magnitude of SHAP values in every model. Are these differences meaningful, so as the feature indeed has different importance depending on an algorithm (RF …

Topic: shap explainable-ai predictor-importance decision-trees

Category: Data Science

Drastic shift in feature importance upon adding other features

Frank

2020年10月5日 10:38

I have a model (GBDT) where adding a feature X is not important (according to SHAP), but when I add other features, and add X again, now feature X is the second most important! What could explain that? How do I investigate what is going on?

Topic: predictor-importance feature-engineering xgboost feature-selection

Category: Data Science

Does "feature importance" depend on the model type?

Frank

2020年8月24日 22:00

I was working on a small classification problem (breast cancer data set from sklearn), and trying to decide which features were most important to predict the labels. I understand that there are several ways to define "important feature" here (permutation importance, importance in trees ...), but I did the following: 1) rank the features by coefficient value in a logistic regression ; 2) rank the features by "feature importance" from a random forest. These don't quite tell the same story, …

Topic: predictor-importance feature-engineering feature-selection

Category: Data Science

Average of importance gain for a categorical variable

Ric S

2020年8月18日 03:01

Suppose I have a set of M categorical variables, some of them with a different number of categories (for instance, var1 has five categories, var2 has three, etc). I train an XGBoost model on a numeric target Y after having performed one-hot encoding on the M categorical variables, thus creating a set of dummy inputs. When looking at the model results, I get a table of importance gain for the categories of each feature, meaning how important they are in …

Topic: predictor-importance xgboost categorical-data

Category: Data Science

SHAP value analysis gives different feature importance on train and test set

pbk

2020年7月21日 21:02

Should SHAP value analysis be done on the train or test set? What does it mean if the feature importance based on mean |SHAP value| is different between the train and test set of my lightgbm model? I intend to use SHAP analysis to identify how each feature contributes to each individual prediction and possibly identify individual predictions that are anomalous. For instance, if the individual prediction's top (+/-) contributing features are vastly different from that of the model's feature …

Topic: shap lightgbm features predictor-importance

Category: Data Science

Identifying and Accounting for trend/seasonality in Predictor Variables

Stephen R

2020年7月15日 02:00

I'm currently working with a dataset that has been collected over several years, and I suspect my predictor variables are changing over time for their predictive power. I could go back year by year and run the data the same way each time to see how efficient each predictor is, then trend the predictive power over time manually. There has to be a better way. Can anyone point me towards the technique I should cram on?

Topic: predictor-importance regression predictive-modeling

Category: Data Science

Using feature importance to decet latent variables and grouping

giogix

2020年5月30日 20:26

Is it possible to use feature importance from Random Forests (e.g. based on gini impurity) or other models to determine which features I can use to group the rows of my dataset homogeneously? For example, let's say I have a dataset with N rows and p columns, (one of the columns used as the label in my training task). I train the model and I get a rank of the importance of my features. Only 5 features are more important …

Topic: predictor-importance random-forest feature-selection clustering

Category: Data Science

Positive or negative impact of features in prediction with Random Forest

Lightning Blade

2020年5月5日 17:17

In classification, when we want to get the importance of each variable in the random forest algorithm we usually use Mean Decrease in Gini or Mean Decrease in Accuracy metrics. Now is there a metric which computes the positive or negative effects of each variable not on the predictive accuracy of the model but rather on the dependent variable itself? Something like the beta coefficients in the standard linear regression model but in the context of classification with random forests.

Topic: predictor-importance random-forest classification machine-learning

Category: Data Science

Why not use constant instead of permutation for model agnostic predictor importance?

ran8

2020年2月26日 18:17

I want to determine predictor importance. Ideal is to re-train same model on same dataset missing each variable in turn. This is too time consuming. The recommendation I have seen everywhere is to "remove" the column by converting it into noise by replacing it with its permutation. Why is it not better to replace the variable with a constant, thus "muting" the signal? I ran an experiment on my own natural dataset with highly cross-correlated variables removed. Variable importance was …

Topic: predictor-importance feature-selection

Category: Data Science

Keras most important features for text classification

wannabenerd

2020年2月24日 12:16

I am working on a problem where I need to classify phrases in one of the two categories (let's A & B). I used the Keras SepCNN model (similar to this) for that and it is giving me some results. Now, I want to analyse the predictions and more specifically I want to know why the model classified a certain phrase in category A or B, which set of features played an important role in labeling that phrase as category …

Topic: features predictor-importance keras deep-learning classification

Category: Data Science

Model-independent measures for feature importance given highly correlated features

Lennart

2020年2月14日 10:20

I am currently working on a research project where the central question is which features drive the prediction of different models. The main issue is, that there is high (multi-)collinearity among those features. Imagine a setting with about 200 different features that are all potential candidates for helping predict the same dependent variable. In the past, the relevance of small sub-sets of like 5 of these features has simply been analyzed by throwing them into a linear regression model and …

Topic: predictor-importance neural-network feature-selection predictive-modeling

Category: Data Science

About