Does Decision tree classifier calculate entropies before transforming categorical features using OneHotEncoder or transformation should be done
I am new to machine learning, and I've got to the point to drop out from it as online tutorials are pretty confusing as well.
Entropy and Decision trees
One of confusing tutorials was as the following:
Another tutorial was pretty straightforward and comprehensive in term of how entropies and information gain is being calculated, but he didn't split the data.
Where the instructor started the entropy calculations and ended up with the following tree:
I did understand the calculation as we used to calculate entropies since high school. The confusing part starts when I decided to program the same steps using python:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_excel('data.xlsx')
X = dataset.iloc[:, 0:4]
y = dataset.iloc[:, 4]
At this point, I tried to apply DecisionTreeClassifier
directly on X
and y
. And BAM. Lots of errors appeared at the console. So I used LabelEncoder
and OneHotEncoder
:
from sklearn.preprocessing import LabelEncoder
lb = LabelEncoder()
X['Outlook'] = lb.fit_transform(X['Outlook'])
X['Temp'] = lb.fit_transform(X['Temp'])
X['Humidity'] = lb.fit_transform(X['Humidity'])
X['Windy'] = lb.fit_transform(X['Windy'])
y = lb.fit_transform(y)
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
ohe.fit_transform(X)
I then figured it out, as there is categorical data that should splitted using OneHotEncoder
. By that, the 4 columns (Outlook, Temp, Humidity, and Windy) will turn into:
Rainy Overcast Sunny Hot Cold Mild High Normal False True
As independent features, and then the calculation should start from this point. So which concept is used by the model to calculate the entropies, the one that confused me even if user used OneHotEncoder
or my logic is the true one.
Topic information-theory decision-trees python machine-learning
Category Data Science