Frequency Encoding
It is a way to utilize the frequency of the categories as labels. In the cases where the frequency is related somewhat to the target variable, it helps the model to understand and assign the weight in direct and inverse proportion, depending on the nature of the data. Replace the categories with the count of the observations that show that category in the dataset. Similarly, we can replace the category by the frequency -or percentage- of observations in the dataset.
Implementation of Frequency Encoding:
- Using
category_encoders
, zowlex's answer
- Using
feature-engine
- CountFrequencyEncoder
feature-engine
- CountFrequencyEncoder
let’s load the data and separate it into train and test:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from feature_engine.encoding import CountFrequencyEncoder
# Load dataset
def load_titanic():
data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
data = data.replace('?', np.nan)
data['cabin'] = data['cabin'].astype(str).str[0]
data['pclass'] = data['pclass'].astype('O')
data['embarked'].fillna('C', inplace=True)
return data
data = load_titanic()
# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['survived', 'name', 'ticket'], axis=1),
data['survived'], test_size=0.3, random_state=0)
set up the CountFrequencyEncoder() to replace the categories by their frequencies, only in the 3 indicated variables:
# set up the encoder
encoder = CountFrequencyEncoder(encoding_method='frequency',
variables=['cabin', 'pclass', 'embarked'])
# fit the encoder
encoder.fit(X_train)
With fit() the encoder learns the frequencies of each category, which are stored in its encoder_dict_ parameter:
encoder.encoder_dict_
We can now go ahead and replace the original strings with the numbers:
# transform the data
train_t= encoder.transform(X_train)
test_t= encoder.transform(X_test)