How to group multiple categories of a categorical variable before feeding the data to a machine learning algorithm?

I have a labelled dataset to which I wish to fit a classification model (say, a Decision Tree). One of the categorical variables (say STATE) in the data has a lot of categories (say 100 different STATES). Using One-Hot encoding on such categorical variables would create very sparse features, deteriorating the performance of the model. There are other methods of encoding of course, like binary encoding, But they introduce bias in some non-trivial ways. Some articles suggest we group different …
Category: Data Science

Simple CART model example

My goal is to test Decision tree to regression model. My data is like below(python dataframe). There are 2 features F1 and F2. And there is label which is number. How to make CART model from this using sklearn or Tensorflow? (I've searched the examples but they look complex for beginner like me.) import pandas as pd df = pd.Dataframe({'F1',[a,a,b,b],'F2',[a,b,a,b],'Label',[10,20,100,200]}) F1 F2 label a a 10 a b 20 b a 100 b b 200
Category: Data Science

How to get variance for regression tree fit?

Suppose the true function is a tree such that: $$f(x)=\sum_{j=1}^{J}b_j I(x \in R_j)+e_i$$ where $b_j=E(y|x \in R_j)$ ,$E(e_i)=0$ and $R_j$ as terminal node. Suppose we got a fit for this tree via CART and cross validation so: $$\hat{f}(x)=\sum_{j=1}^{\hat{J}}\hat{b_j}I(x \in \hat{R_j})$$ where $\hat{b}_j=sample\_avg(y_i|x_i \in \hat{R}_j)$ How could I get the variance of $\hat{f}(x)$ knowing $\hat{J}$, $\hat{b_j}$ and $\hat{R}_j$ as random variables?
Category: Data Science

scikit learn target variable reversed (DecisionTreeClassifier)

I created a Decision Tree Classifier using sklearn, defined the target variable: #extract features and target variables x = df.drop(columns="target_column",) y = df["target_column"] #save the feature name and target variables feature_names = x.columns labels = y.unique() #split the dataset from sklearn.model_selection import train_test_split x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.3, random_state = 42) Additionally I checked the count of each of the two classes (Success, Failure) within y which confirmed to me that each has the correct count. …
Category: Data Science

List of samples that each tree in a random forest is trained on in Scikit-Learn

In Scikit-learn's random forest, you can set bootstrap=True and each tree would select a subset of samples to train on. Is there a way to see which samples are used in each tree? I went through the documentation about the tree estimators and all the attributes of the trees that are made available by Scikit-learn, but none of them seems to provide what I'm looking for.
Category: Data Science

Looking for CART/ML model that works with relative data

I am a beginner at AI and ML. I have been given a dataset, where I have noticed the columns are relative to one another. So is there any CART or ML model that can work with relative data ? For example in Decision Tree, the tree looks like : if X[0]<192: if X[1]>24: if X[2]<12: ... I'm looking for a Decision Tree, that works like this : if X[0]>X[1]: if X[1]<X[2]: ... Is there any such Machine Learning Model …
Category: Data Science

how are split decisions for observations(not features) made in decision trees

I have read a lot of articles about decision trees, and every one of them only focused on telling how a feature/column is considered for split, based on criterion like gini index, entropy, chi-square and information gain. But, not one talked about the observation part. Example: Let's say I have a dataset with 3 independent features and 1 discrete target variable, namely height_in_cm(like 130, 140), performance_in_class(like below average, average, very good), class(like 7th, 8th or 10th class) and plays_cricket(1 for …
Category: Data Science

Random selection of variables in each run of python sklearn decision tree (regressio )

When I put random_state = None and run Decision tree for regression in python sklearn, it takes different variables to build tree each time? Shouldn't there be only few top variables which should be used to split and should throw me similar trees everytime? Also, if I use integer for random_state and run the decision tree, it gives me a different tree for each random_state setting. Which tree should be selected in case of so many trees?
Category: Data Science

Prediction in CART Decision Trees

I was studying the algorithm of CART (classification and regression trees), but the formula of the prediction is irritating me. First we have the following definition: Let $X:={x_1,...,x_N} \subset \mathbb{R}^d $ of datapoints and $B$ the smallest box: $$ B(X):=\{z\in \mathbb{R}^d : \min_{x\in X} x_j \leq z_j \leq \max_{x\in X} \forall j\in [d]\}$$ and let $I$ be the indicator function: $$I[p]=1 \text{ if }p \text{ holds } 0 \text{ otherwise}$$ So let's imagine that the CART algorithm has split the …
Category: Data Science

What is the difference between a decision tree and something called "subgroup discovery algorithms"?

I'm reading a paper which states that subgroup discovery is: Subgroup discovery is a data mining technique whose goal is to detect interesting subgroups into a population with respect to a property of interest The paper goes on to make the distinctions between a decision tree and subgroup discovery, but does not explain the rationale/reasoning. With a google search on subgroup discovery algorithms I find the following: The goal of the subgroup discovery algorithm SD, outlined in Figure 1, is …
Category: Data Science

CART classification for imbalanced datasets with R

Hey guys i need your help for a university project. The main Task is to analyze the effects of over/under-smapling on a imbalanced Dataset. But before we can even start with that, our task sheet says, that we 1) have to find/create imbalanced Datasets and 2) fit those with a binary classification model like CART. So my auestions would be, where do i find such imbalanced datasets? And how do i fit those datasets with CART, and what does that …
Category: Data Science

Make a random forest estimator the exact same of a decision tree

The idea is to make one of the trees of a Random Forest, to be built exactly equal to a Decision Tree. First, we load all libraries, fit a decision tree and plot it. import numpy as np import pandas as pd import matplotlib.pyplot as plt plt.style.use('ggplot') %matplotlib inline import random from pprint import pprint import pdb random.seed(0) np.random.seed(0) from sklearn.tree import DecisionTreeClassifier from sklearn import tree from sklearn.datasets import load_iris from sklearn.ensemble import RandomForestClassifier data = load_iris() dtc = …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.