Does Decision tree classifier calculate entropies before transforming categorical features using OneHotEncoder or transformation should be done

I am new to machine learning, and I've got to the point to drop out from it as online tutorials are pretty confusing as well.

Entropy and Decision trees

One of confusing tutorials was as the following:

Another tutorial was pretty straightforward and comprehensive in term of how entropies and information gain is being calculated, but he didn't split the data.

Where the instructor started the entropy calculations and ended up with the following tree:

I did understand the calculation as we used to calculate entropies since high school. The confusing part starts when I decided to program the same steps using python:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_excel('data.xlsx')
X = dataset.iloc[:, 0:4]
y = dataset.iloc[:, 4]

At this point, I tried to apply DecisionTreeClassifier directly on X and y. And BAM. Lots of errors appeared at the console. So I used LabelEncoder and OneHotEncoder:

from sklearn.preprocessing import LabelEncoder
lb = LabelEncoder()
X['Outlook'] = lb.fit_transform(X['Outlook'])
X['Temp'] = lb.fit_transform(X['Temp'])
X['Humidity'] = lb.fit_transform(X['Humidity'])
X['Windy'] = lb.fit_transform(X['Windy'])
y = lb.fit_transform(y)

from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()

ohe.fit_transform(X)

I then figured it out, as there is categorical data that should splitted using OneHotEncoder. By that, the 4 columns (Outlook, Temp, Humidity, and Windy) will turn into:

Rainy   Overcast   Sunny   Hot   Cold   Mild     High   Normal    False    True

As independent features, and then the calculation should start from this point. So which concept is used by the model to calculate the entropies, the one that confused me even if user used OneHotEncoder or my logic is the true one.

Topic information-theory decision-trees python machine-learning

Category Data Science


It's not totally clear to me what is your main question but here are a few points which might help you:

  • Like any regular classifier, a decision tree classifier does not transform the features. It's up to the user if they want to apply any transformation before calling the classifier.
  • Standard decision tree classifiers can deal with features which have multiple values.

Afaik training a decision tree model using a library classifier is not going to help you much understanding how the model is calculated, it's just going to give you the result. You could try to code your own implementation if you want.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.