How to generate a rule-based system based on binary data?

I have a dataset where each row is a sample and each column is a binary variable. The meaning of $X_{i, j} = 1$ is that we've seen feature $j$ for sample $i$. $X_{i, j} = 0$ means that we haven't seen this feature but we might will. We have around $1000$ binary variables and around $200k$ samples. The target variable, $y$ is categorical. What I'd like to do is to find subsets of variables that precisely predict some $y_k$. …
Category: Data Science

GridSearch multiplying the number of trees in XGboost?

I'm having an issue: after running an XGboost in a HalvingGridSearchCV, I receive a certain number of estimators (50 for example), but the number of trees is constantly being multiplied by 3. I don't understand why. Here is the code: model = XGBClassifier(objective='multi:softprob', subsample = 0.9, colsample_bytree=0.5, num_class= 3) md = [3, 6, 10, 15] lr = [0.1, 0.5, 1] g = [0, 0.25, 1] rl = [0, 1, 10] spw = [1, 3, 5] ns = [5, 10, 20] …
Category: Data Science

How to decide who to market? Clustering or Decision Tree?

I am working with a dataset that has enough observations and ~ 10 variables, half of the variables are numeric another half of the variables are categorical with 2-3 levels (demographics) one ID variable one last variable that has sales value, 0 for no sale and bill amount for sale Using this information, I want to understand which segments of my customers to market. I am using R for code but that's not relevant here. :) I am confused about …
Category: Data Science

Isolation Forest Score Function Theory

I am currently reading this paper on isolation forests. In the section about the score function, they mention the following. For context, $h(x)$ is definded as the path length of a data point traversing an iTree, and $n$ is the sample size used to grow the iTree. The difficulty in deriving such a score from $h(x)$ is that while the maximum possible height of iTree grows in the order of $n$, the average height grows in the order of $log(n)$. …
Category: Data Science

How to find the dependent variables from a dataset?

I am stuck at where how can I get the most dependent variables based on the mean I have this dataset and when I try to: df.groupby('left').mean() It gives the output: And one of my friends said, from that graph the dependent variables for the attribute left will be 1.Satisfaction Level 2.Average Monthly Hours 3.Promotion Last 5 Years I am wondering How could someone guess that?
Category: Data Science

Fix first two levels of decision tree?

I am trying to build a regression tree with 70 attributes where the business team wants to fix the first two levels namely country and product type. To achieve this, I have two proposals: Build a separate tree for each combination of country and product type and use subsets of the data accordingly and pass on to respective tree for prediction. Seen here in comments. I have 88 levels in country and 3 levels in product type so it will …
Category: Data Science

Decision trees for anomaly detection

Problem From what I understand, a common method in anomaly detection consists in building a predictive model trained on non-anomalous training data, and perform anomaly detection using the error of the model when predicting on the observed data. This method requires the user to identify non-anomalous data beforehand. What if it's not possible to label non-anomalous data to train the model? Is there anything in literature that explain how to overcome this issue? I have an idea, but I was …
Category: Data Science

Should I resample my dataset?

The dataset that I have is some text data consisting of path names. I am using TF-IDF vectorizer and decision trees. The classes in my dataset are severely imbalanced. There are a few big classes with a number of samples more than 500 and some other minor classes with a number of samples less than 100. Some are even smaller (less than 20). This is real data collected, so the chance where the model seeing minor class in actual implementation …
Category: Data Science

Visualizing decision tree with feature names

from scipy.sparse import hstack X_tr1 = hstack((X_train_cc_ohe, X_train_csc_ohe, X_train_grade_ohe, X_train_price_norm, X_train_tnppp_norm, X_train_essay_bow, X_train_pt_bow)).tocsr() X_te1 = hstack((X_test_cc_ohe, X_test_csc_ohe, X_test_grade_ohe, X_test_price_norm, X_test_tnppp_norm, X_test_essay_bow, X_test_pt_bow)).tocsr() X_train_cc_ohe and all are vectorized categorical data, and X_train_pt_bow is bag of words vectorized text data. Now, I applied a decision tree classifier on this model and got this: I took max_depth as 3 just for visualization purposes. My question is: I would like to get feature names in my output instead of index as X2599, X4 etc. …
Category: Data Science

How does the construction of a decision tree differ for different optimization metrics?

I understand how a decision tree is constructed (in the ID3 algorithm) using criterion such as entropy, gini index, and variance reduction. But the formulae for these criteria do not care about optimization metrics such as accuracy, recall, AUC, kappa, f1-score, and others. R and Python packages allow me to optimize for such metrics when I construct a decision tree. What do they do differently for each of these metrics? Where does the change happen? Is there a pattern to …
Category: Data Science

Why is rpart not splitting this data even when there is gain in gini?

df <- tibble(x1=factor(c("S1", "S1", "S2", "S2")), y=factor(c(1, 1, 0, 1))) md <- rpart(formula=y~., data=df, method="class", control=rpart.control(minsplit=2, cp=0)) nrow(md$frame) #outputs 1 Consider the split left child node: "S1", 1 "S1", 1 Right child node: "S2", 0 "S2", 1 Here the gain in gini would be ${1 \over 8} = 0.125$ Why is rpart not doing this split?
Category: Data Science

Do cost-complexity pruned trees perform better relative to unpruned trees?

I came across an adapted question from the famous ISLR book and realise I am unsure of the answer. Does anyone know? Interested in the intuition here! Cost-complexity pruned trees with $\alpha=1$ relative to unpruned trees (select one): a. Will have better performance due to increased flexibility when its increase in bias is less than its decrease in variance. b. item Will have better performance due to increased flexibility when its increase in variance is less than its decrease in …
Category: Data Science

How do I design a random forest split with a "not sure" category?

Let's say I have data with two target labels, A and B. I want to design a random forest that has three outputs: A, B and Not sure. Items in the Not sure category would be a mix of A and B that would be about evenly distributed. I don't mind writing the RF from scratch. Two questions: What should my split criterion be? Can this problem be reposed in a standard RF framework?
Category: Data Science

How does ExtraTrees (Extremely Randomized Trees) learn?

I'm trying to understand the difference between random forests and extremely randomized trees (https://orbi.uliege.be/bitstream/2268/9357/1/geurts-mlj-advance.pdf) I understand that extratrees uses random splits and no bootstrapping, as covered here: https://stackoverflow.com/questions/22409855/randomforestclassifier-vs-extratreesclassifier-in-scikit-learn The question I'm struggling with is, if all the splits are randomized, how does a extremely randomized decision tree learn anything about the objective function? Where is the 'optimization' step?
Category: Data Science

How to avoid memory error with Pandas pd.read_csv method call with GridSearchCV usage for DecisionTreeRegressor model?

I have been implementing a DecisionTreeRegressor model in Anaconda environment with a data set sourced from a 20 million row, 12-dimensional CSV file. I could get the chunks off of the data set with chunksize set to 500,000 rows and process the computation of the R-Squared score on the training/test split data sets in each iteration of 500,000 rows till iteration #20. sklearn.__version__: 0.19.0 pandas.__version__: 0.20.3 numpy.__version__: 1.13.1 The GridSearchCV() instance uses parameter grid with parameter max_depth set to values …
Category: Data Science

Decision Trees change result at every run, how can I trust of my results?

Given a database, I split the data in train and test. I want to use a decision-tree classifier (sklearn) for a binary classification problem. Considering I already found the best parameters for my model, if I run the model on the test set I obtain at each run (considering the same hyper-parameters) a different result. Why that? Considering I am using as metric the accuracy score, I have variations from 0.5 to 0.8. Which result should I take as correct, …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.