I have a dataset where each row is a sample and each column is a binary variable. The meaning of $X_{i, j} = 1$ is that we've seen feature $j$ for sample $i$. $X_{i, j} = 0$ means that we haven't seen this feature but we might will. We have around $1000$ binary variables and around $200k$ samples. The target variable, $y$ is categorical. What I'd like to do is to find subsets of variables that precisely predict some $y_k$. …
I'm having an issue: after running an XGboost in a HalvingGridSearchCV, I receive a certain number of estimators (50 for example), but the number of trees is constantly being multiplied by 3. I don't understand why. Here is the code: model = XGBClassifier(objective='multi:softprob', subsample = 0.9, colsample_bytree=0.5, num_class= 3) md = [3, 6, 10, 15] lr = [0.1, 0.5, 1] g = [0, 0.25, 1] rl = [0, 1, 10] spw = [1, 3, 5] ns = [5, 10, 20] …
I am working with a dataset that has enough observations and ~ 10 variables, half of the variables are numeric another half of the variables are categorical with 2-3 levels (demographics) one ID variable one last variable that has sales value, 0 for no sale and bill amount for sale Using this information, I want to understand which segments of my customers to market. I am using R for code but that's not relevant here. :) I am confused about …
For a data science project, I first used a standardized scaler on data in python, ran random forest then plotted the tree. However, the values of the decisions are in their standardized form. How do I plot the unscaled data? Example: as is: decision node based on Age <= 2.04 desired: decision node based on Age <= 30
I am currently reading this paper on isolation forests. In the section about the score function, they mention the following. For context, $h(x)$ is definded as the path length of a data point traversing an iTree, and $n$ is the sample size used to grow the iTree. The difficulty in deriving such a score from $h(x)$ is that while the maximum possible height of iTree grows in the order of $n$, the average height grows in the order of $log(n)$. …
I am stuck at where how can I get the most dependent variables based on the mean I have this dataset and when I try to: df.groupby('left').mean() It gives the output: And one of my friends said, from that graph the dependent variables for the attribute left will be 1.Satisfaction Level 2.Average Monthly Hours 3.Promotion Last 5 Years I am wondering How could someone guess that?
In the original publication of the Isolation Forest algorithm, the authors mention a height limit parameter to control the granularity of the algorithm. I did not find that explicit parameter on the Sklearn implementation of the algorithm, and I was wondering whether it is possible to control granularity in some other way?
In Rapidminer, are the decision tree's weights a measure of the "importance" of attributes in the splitting procedure ? If yes, why is useful to know these weights ? Are there better methods to know the most discriminant features in a data set ?
I am trying to build a regression tree with 70 attributes where the business team wants to fix the first two levels namely country and product type. To achieve this, I have two proposals: Build a separate tree for each combination of country and product type and use subsets of the data accordingly and pass on to respective tree for prediction. Seen here in comments. I have 88 levels in country and 3 levels in product type so it will …
Problem From what I understand, a common method in anomaly detection consists in building a predictive model trained on non-anomalous training data, and perform anomaly detection using the error of the model when predicting on the observed data. This method requires the user to identify non-anomalous data beforehand. What if it's not possible to label non-anomalous data to train the model? Is there anything in literature that explain how to overcome this issue? I have an idea, but I was …
The dataset that I have is some text data consisting of path names. I am using TF-IDF vectorizer and decision trees. The classes in my dataset are severely imbalanced. There are a few big classes with a number of samples more than 500 and some other minor classes with a number of samples less than 100. Some are even smaller (less than 20). This is real data collected, so the chance where the model seeing minor class in actual implementation …
from scipy.sparse import hstack X_tr1 = hstack((X_train_cc_ohe, X_train_csc_ohe, X_train_grade_ohe, X_train_price_norm, X_train_tnppp_norm, X_train_essay_bow, X_train_pt_bow)).tocsr() X_te1 = hstack((X_test_cc_ohe, X_test_csc_ohe, X_test_grade_ohe, X_test_price_norm, X_test_tnppp_norm, X_test_essay_bow, X_test_pt_bow)).tocsr() X_train_cc_ohe and all are vectorized categorical data, and X_train_pt_bow is bag of words vectorized text data. Now, I applied a decision tree classifier on this model and got this: I took max_depth as 3 just for visualization purposes. My question is: I would like to get feature names in my output instead of index as X2599, X4 etc. …
I understand how a decision tree is constructed (in the ID3 algorithm) using criterion such as entropy, gini index, and variance reduction. But the formulae for these criteria do not care about optimization metrics such as accuracy, recall, AUC, kappa, f1-score, and others. R and Python packages allow me to optimize for such metrics when I construct a decision tree. What do they do differently for each of these metrics? Where does the change happen? Is there a pattern to …
df <- tibble(x1=factor(c("S1", "S1", "S2", "S2")), y=factor(c(1, 1, 0, 1))) md <- rpart(formula=y~., data=df, method="class", control=rpart.control(minsplit=2, cp=0)) nrow(md$frame) #outputs 1 Consider the split left child node: "S1", 1 "S1", 1 Right child node: "S2", 0 "S2", 1 Here the gain in gini would be ${1 \over 8} = 0.125$ Why is rpart not doing this split?
when using matlab command 'fitctree' for classification purpose, and I change the order of the attributes I do not find the same Tree and thus the same classificaiton error? why? CART algorithm does take account on the attributes firstly introduced ?
I came across an adapted question from the famous ISLR book and realise I am unsure of the answer. Does anyone know? Interested in the intuition here! Cost-complexity pruned trees with $\alpha=1$ relative to unpruned trees (select one): a. Will have better performance due to increased flexibility when its increase in bias is less than its decrease in variance. b. item Will have better performance due to increased flexibility when its increase in variance is less than its decrease in …
Let's say I have data with two target labels, A and B. I want to design a random forest that has three outputs: A, B and Not sure. Items in the Not sure category would be a mix of A and B that would be about evenly distributed. I don't mind writing the RF from scratch. Two questions: What should my split criterion be? Can this problem be reposed in a standard RF framework?
I'm trying to understand the difference between random forests and extremely randomized trees (https://orbi.uliege.be/bitstream/2268/9357/1/geurts-mlj-advance.pdf) I understand that extratrees uses random splits and no bootstrapping, as covered here: https://stackoverflow.com/questions/22409855/randomforestclassifier-vs-extratreesclassifier-in-scikit-learn The question I'm struggling with is, if all the splits are randomized, how does a extremely randomized decision tree learn anything about the objective function? Where is the 'optimization' step?
I have been implementing a DecisionTreeRegressor model in Anaconda environment with a data set sourced from a 20 million row, 12-dimensional CSV file. I could get the chunks off of the data set with chunksize set to 500,000 rows and process the computation of the R-Squared score on the training/test split data sets in each iteration of 500,000 rows till iteration #20. sklearn.__version__: 0.19.0 pandas.__version__: 0.20.3 numpy.__version__: 1.13.1 The GridSearchCV() instance uses parameter grid with parameter max_depth set to values …
Given a database, I split the data in train and test. I want to use a decision-tree classifier (sklearn) for a binary classification problem. Considering I already found the best parameters for my model, if I run the model on the test set I obtain at each run (considering the same hyper-parameters) a different result. Why that? Considering I am using as metric the accuracy score, I have variations from 0.5 to 0.8. Which result should I take as correct, …