I noticed that I am getting different feature importance results with each random forest run even though they are using the same parameters. Now, I know that a random forest model takes observations randomly which is causing the importance levels to vary. This is especially shown for the less important variables. My question is how does one interpret the variance in random forest results when running it multiple times? I know that one can reduce the instability level of results …
I've developed a text classifier of the form of python function that can input a np.array of strings (each string is one observation). def model(vector_of_strins): ... # do smthg return vec_of_probabilities # like [0.1, 0.23, ..., 0.09] When I try to use KernelExplainer from shap package like that test_texts = pd.Series(['text1','text2','text3']) shap.KernelExplainer(model, test_texts ) I receive the following error: AttributeError: 'numpy.ndarray' object has no attribute 'find' What can I do about it?
Thank you very much for your time. I would like to compare the contribution of predictor variables (X1) from different BRT (Boosted regression trees) models, which are from different spatial scales (Radius= 250m, 160m, 80m, 40m, 10m). But the contribution in BRT is a relative contribution. I want to know each variable's absolute contribution (absolute importance), then I can compare them. I used the formula "R-squared = 1 - (Residual sum of squares/Total sum of squares)" to calculate R-squared (total …
I was wondering if there exists techniques to cluster data according to a target. For example, suppose we want to find groups of customers likely to churn: Target is churn. We want to find clusters exhibiting the same behaviour according to the fact that they are likely to churn (or not). Therefore, variables not explaining churn behaviour should not influence how clusters are built. I have done this analysis the following way: Predict target (e.g. using a Random Forest) and …
I think this is a very basic question so sorry for the wordy format. I am trying to get my head around it. I am thinking about predicting earthquake damage to property in the US using a GLM algorithm. I start with my predictor data, say: State (categorical), Owner wealth bracket (discrete), Bedrooms in property (discrete), Earthquake resistance number (discrete), and my response variable: Claim amount in a given year (continuous). I may decide at the beginning to split my …
This agnostic-model is not well addressed in research papers. I read articles where it was used to test the accuracy of the models, trying to understand the importance of individual features and their contribute to the model. I saw values ranging from negative values to 10 or even more. I am wondering which would be the expected values from a such method and which considerations should be done. I would expect that, after extracting many features from data and building …
Let's say we have a categorical feature $X_i$ and we have build a black-box classification model like xgboost with $X_i$ as one of many predictors. We'd like to ask a question: does $X_i$ affects the overall prediction and, if so, how much? In particular $X_i$ could be: a dichotomous variable a n-level variable where we are interested in the potential difference between two particular levels In white-box models like linear regression we have tests to obtain statistical significance. But can …
Let's say that we have $$f(x,y,z) = x/k - (y/k) ((z - x/k)/(z - y/k))$$ $$k = constant \in ]0,1[$$ And I need to show in some way that the variable $x$ is more important in some metric that I don't know which one could be good. I thought about analyze the partial derivatives of that functions, but I don't think that is a good way, because one will only see some restricted path through the surface. Another approach would …
I have built an XGBoost classification model in Python on an imbalanced dataset (~1 million positive values and ~12 million negative values), where the features are binary user interaction with web page elements (e.g. did the user scroll to reviews or not) and the target is a binary retail action. My ultimate goal was not so much to achieve a model with an optimal decision rule performance as to understand which user actions/features are important in determining the positive retail …
A couple of questions on the SHAP approach to the estimation of feature importance. I would like to use the random forest, logistic regression, SVM, and kNN to train four classification models on a dataset. Parameters in each training are chosen to give the best accuracy and precision for every model. A feature has a different magnitude of SHAP values in every model. Are these differences meaningful, so as the feature indeed has different importance depending on an algorithm (RF …
I have a model (GBDT) where adding a feature X is not important (according to SHAP), but when I add other features, and add X again, now feature X is the second most important! What could explain that? How do I investigate what is going on?
I was working on a small classification problem (breast cancer data set from sklearn), and trying to decide which features were most important to predict the labels. I understand that there are several ways to define "important feature" here (permutation importance, importance in trees ...), but I did the following: 1) rank the features by coefficient value in a logistic regression ; 2) rank the features by "feature importance" from a random forest. These don't quite tell the same story, …
Suppose I have a set of M categorical variables, some of them with a different number of categories (for instance, var1 has five categories, var2 has three, etc). I train an XGBoost model on a numeric target Y after having performed one-hot encoding on the M categorical variables, thus creating a set of dummy inputs. When looking at the model results, I get a table of importance gain for the categories of each feature, meaning how important they are in …
Should SHAP value analysis be done on the train or test set? What does it mean if the feature importance based on mean |SHAP value| is different between the train and test set of my lightgbm model? I intend to use SHAP analysis to identify how each feature contributes to each individual prediction and possibly identify individual predictions that are anomalous. For instance, if the individual prediction's top (+/-) contributing features are vastly different from that of the model's feature …
I'm currently working with a dataset that has been collected over several years, and I suspect my predictor variables are changing over time for their predictive power. I could go back year by year and run the data the same way each time to see how efficient each predictor is, then trend the predictive power over time manually. There has to be a better way. Can anyone point me towards the technique I should cram on?
Is it possible to use feature importance from Random Forests (e.g. based on gini impurity) or other models to determine which features I can use to group the rows of my dataset homogeneously? For example, let's say I have a dataset with N rows and p columns, (one of the columns used as the label in my training task). I train the model and I get a rank of the importance of my features. Only 5 features are more important …
In classification, when we want to get the importance of each variable in the random forest algorithm we usually use Mean Decrease in Gini or Mean Decrease in Accuracy metrics. Now is there a metric which computes the positive or negative effects of each variable not on the predictive accuracy of the model but rather on the dependent variable itself? Something like the beta coefficients in the standard linear regression model but in the context of classification with random forests.
I want to determine predictor importance. Ideal is to re-train same model on same dataset missing each variable in turn. This is too time consuming. The recommendation I have seen everywhere is to "remove" the column by converting it into noise by replacing it with its permutation. Why is it not better to replace the variable with a constant, thus "muting" the signal? I ran an experiment on my own natural dataset with highly cross-correlated variables removed. Variable importance was …
I am working on a problem where I need to classify phrases in one of the two categories (let's A & B). I used the Keras SepCNN model (similar to this) for that and it is giving me some results. Now, I want to analyse the predictions and more specifically I want to know why the model classified a certain phrase in category A or B, which set of features played an important role in labeling that phrase as category …
I am currently working on a research project where the central question is which features drive the prediction of different models. The main issue is, that there is high (multi-)collinearity among those features. Imagine a setting with about 200 different features that are all potential candidates for helping predict the same dependent variable. In the past, the relevance of small sub-sets of like 5 of these features has simply been analyzed by throwing them into a linear regression model and …