decision-trees

How to generate a rule-based system based on binary data?

greenButMellow

2022年6月2日 22:38

I have a dataset where each row is a sample and each column is a binary variable. The meaning of $X_{i, j} = 1$ is that we've seen feature $j$ for sample $i$. $X_{i, j} = 0$ means that we haven't seen this feature but we might will. We have around $1000$ binary variables and around $200k$ samples. The target variable, $y$ is categorical. What I'd like to do is to find subsets of variables that precisely predict some $y_k$. …

Topic: decision-trees logistic-regression classification statistics machine-learning

Category: Data Science

GridSearch multiplying the number of trees in XGboost?

Cosapocha

2022年6月2日 16:43

I'm having an issue: after running an XGboost in a HalvingGridSearchCV, I receive a certain number of estimators (50 for example), but the number of trees is constantly being multiplied by 3. I don't understand why. Here is the code: model = XGBClassifier(objective='multi:softprob', subsample = 0.9, colsample_bytree=0.5, num_class= 3) md = [3, 6, 10, 15] lr = [0.1, 0.5, 1] g = [0, 0.25, 1] rl = [0, 1, 10] spw = [1, 3, 5] ns = [5, 10, 20] …

Topic: gradient-boosting-decision-trees xgboost decision-trees classification machine-learning

Category: Data Science

How to decide who to market? Clustering or Decision Tree?

Data Enthusiast

2022年6月1日 05:03

I am working with a dataset that has enough observations and ~ 10 variables, half of the variables are numeric another half of the variables are categorical with 2-3 levels (demographics) one ID variable one last variable that has sales value, 0 for no sale and bill amount for sale Using this information, I want to understand which segments of my customers to market. I am using R for code but that's not relevant here. :) I am confused about …

Topic: decision-trees marketing classification predictive-modeling clustering

Category: Data Science

Random Forest plot standardized

firedonut123

2022年6月1日 02:11

For a data science project, I first used a standardized scaler on data in python, ran random forest then plotted the tree. However, the values of the decisions are in their standardized form. How do I plot the unscaled data? Example: as is: decision node based on Age <= 2.04 desired: decision node based on Age <= 30

Topic: machine-learning-model decision-trees random-forest python

Category: Data Science

Isolation Forest Score Function Theory

Samyak Shah

2022年5月29日 11:07

I am currently reading this paper on isolation forests. In the section about the score function, they mention the following. For context, $h(x)$ is definded as the path length of a data point traversing an iTree, and $n$ is the sample size used to grow the iTree. The difficulty in deriving such a score from $h(x)$ is that while the maximum possible height of iTree grows in the order of $n$, the average height grows in the order of $log(n)$. …

Topic: anomaly-detection decision-trees random-forest

Category: Data Science

How to find the dependent variables from a dataset?

To Rrent

2022年5月28日 15:07

I am stuck at where how can I get the most dependent variables based on the mean I have this dataset and when I try to: df.groupby('left').mean() It gives the output: And one of my friends said, from that graph the dependent variables for the attribute left will be 1.Satisfaction Level 2.Average Monthly Hours 3.Promotion Last 5 Years I am wondering How could someone guess that?

Topic: decision-trees

Category: Data Science

Isolation Forest height limit absent in SkLearn implementation

jimijazz

2022年5月25日 19:36

In the original publication of the Isolation Forest algorithm, the authors mention a height limit parameter to control the granularity of the algorithm. I did not find that explicit parameter on the Sklearn implementation of the algorithm, and I was wondering whether it is possible to control granularity in some other way?

Topic: decision-trees outlier scikit-learn

Category: Data Science

Rapidminer and decision tree weights

Qwerto

2022年5月25日 04:02

In Rapidminer, are the decision tree's weights a measure of the "importance" of attributes in the splitting procedure ? If yes, why is useful to know these weights ? Are there better methods to know the most discriminant features in a data set ?

Topic: rapidminer decision-trees feature-selection data-mining machine-learning

Category: Data Science

Fix first two levels of decision tree?

Aravind

2022年5月24日 11:08

I am trying to build a regression tree with 70 attributes where the business team wants to fix the first two levels namely country and product type. To achieve this, I have two proposals: Build a separate tree for each combination of country and product type and use subsets of the data accordingly and pass on to respective tree for prediction. Seen here in comments. I have 88 levels in country and 3 levels in product type so it will …

Topic: decision-trees predictive-modeling r machine-learning

Category: Data Science

Decision trees for anomaly detection

giogix

2022年5月24日 03:07

Problem From what I understand, a common method in anomaly detection consists in building a predictive model trained on non-anomalous training data, and perform anomaly detection using the error of the model when predicting on the observed data. This method requires the user to identify non-anomalous data beforehand. What if it's not possible to label non-anomalous data to train the model? Is there anything in literature that explain how to overcome this issue? I have an idea, but I was …

Topic: anomaly-detection decision-trees random-forest clustering

Category: Data Science

Should I resample my dataset?

mike

2022年5月23日 23:05

The dataset that I have is some text data consisting of path names. I am using TF-IDF vectorizer and decision trees. The classes in my dataset are severely imbalanced. There are a few big classes with a number of samples more than 500 and some other minor classes with a number of samples less than 100. Some are even smaller (less than 20). This is real data collected, so the chance where the model seeing minor class in actual implementation …

Topic: decision-trees class-imbalance

Category: Data Science

Visualizing decision tree with feature names

torBhakt

2022年5月23日 00:02

from scipy.sparse import hstack X_tr1 = hstack((X_train_cc_ohe, X_train_csc_ohe, X_train_grade_ohe, X_train_price_norm, X_train_tnppp_norm, X_train_essay_bow, X_train_pt_bow)).tocsr() X_te1 = hstack((X_test_cc_ohe, X_test_csc_ohe, X_test_grade_ohe, X_test_price_norm, X_test_tnppp_norm, X_test_essay_bow, X_test_pt_bow)).tocsr() X_train_cc_ohe and all are vectorized categorical data, and X_train_pt_bow is bag of words vectorized text data. Now, I applied a decision tree classifier on this model and got this: I took max_depth as 3 just for visualization purposes. My question is: I would like to get feature names in my output instead of index as X2599, X4 etc. …

Topic: decision-trees visualization

Category: Data Science

How does the construction of a decision tree differ for different optimization metrics?

sgk

2022年5月21日 19:04

I understand how a decision tree is constructed (in the ID3 algorithm) using criterion such as entropy, gini index, and variance reduction. But the formulae for these criteria do not care about optimization metrics such as accuracy, recall, AUC, kappa, f1-score, and others. R and Python packages allow me to optimize for such metrics when I construct a decision tree. What do they do differently for each of these metrics? Where does the change happen? Is there a pattern to …

Topic: decision-trees optimization algorithms machine-learning

Category: Data Science

Why is rpart not splitting this data even when there is gain in gini?

Malyada N

2022年5月20日 13:04

df <- tibble(x1=factor(c("S1", "S1", "S2", "S2")), y=factor(c(1, 1, 0, 1))) md <- rpart(formula=y~., data=df, method="class", control=rpart.control(minsplit=2, cp=0)) nrow(md$frame) #outputs 1 Consider the split left child node: "S1", 1 "S1", 1 Right child node: "S2", 0 "S2", 1 Here the gain in gini would be ${1 \over 8} = 0.125$ Why is rpart not doing this split?

Topic: decision-trees r

Category: Data Science

Does CART algorithm takes into account in the order of the set of attributes?

LSola

2022年5月20日 11:02

when using matlab command 'fitctree' for classification purpose, and I change the order of the attributes I do not find the same Tree and thus the same classificaiton error? why? CART algorithm does take account on the attributes firstly introduced ?

Topic: matlab decision-trees random-forest algorithms machine-learning

Category: Data Science

Do cost-complexity pruned trees perform better relative to unpruned trees?

HelpMePlease

2022年5月19日 00:35

I came across an adapted question from the famous ISLR book and realise I am unsure of the answer. Does anyone know? Interested in the intuition here! Cost-complexity pruned trees with $\alpha=1$ relative to unpruned trees (select one): a. Will have better performance due to increased flexibility when its increase in bias is less than its decrease in variance. b. item Will have better performance due to increased flexibility when its increase in variance is less than its decrease in …

Topic: decision-trees r

Category: Data Science

How do I design a random forest split with a "not sure" category?

cjm2671

2022年5月18日 18:52

Let's say I have data with two target labels, A and B. I want to design a random forest that has three outputs: A, B and Not sure. Items in the Not sure category would be a mix of A and B that would be about evenly distributed. I don't mind writing the RF from scratch. Two questions: What should my split criterion be? Can this problem be reposed in a standard RF framework?

Topic: decision-trees random-forest

Category: Data Science

How does ExtraTrees (Extremely Randomized Trees) learn?

cjm2671

2022年5月18日 03:04

I'm trying to understand the difference between random forests and extremely randomized trees (https://orbi.uliege.be/bitstream/2268/9357/1/geurts-mlj-advance.pdf) I understand that extratrees uses random splits and no bootstrapping, as covered here: https://stackoverflow.com/questions/22409855/randomforestclassifier-vs-extratreesclassifier-in-scikit-learn The question I'm struggling with is, if all the splits are randomized, how does a extremely randomized decision tree learn anything about the objective function? Where is the 'optimization' step?

Topic: decision-trees random-forest

Category: Data Science

How to avoid memory error with Pandas pd.read_csv method call with GridSearchCV usage for DecisionTreeRegressor model?

AndrewBharadwajKalahasti

2022年5月17日 12:04

I have been implementing a DecisionTreeRegressor model in Anaconda environment with a data set sourced from a 20 million row, 12-dimensional CSV file. I could get the chunks off of the data set with chunksize set to 500,000 rows and process the computation of the R-Squared score on the training/test split data sets in each iteration of 500,000 rows till iteration #20. sklearn.__version__: 0.19.0 pandas.__version__: 0.20.3 numpy.__version__: 1.13.1 The GridSearchCV() instance uses parameter grid with parameter max_depth set to values …

Topic: ensemble-modeling decision-trees scikit-learn pandas python

Category: Data Science

Decision Trees change result at every run, how can I trust of my results?

Mark

2022年5月17日 03:03

Given a database, I split the data in train and test. I want to use a decision-tree classifier (sklearn) for a binary classification problem. Considering I already found the best parameters for my model, if I run the model on the test set I obtain at each run (considering the same hyper-parameters) a different result. Why that? Considering I am using as metric the accuracy score, I have variations from 0.5 to 0.8. Which result should I take as correct, …

Topic: decision-trees cross-validation machine-learning

Category: Data Science

About