Frequency/Count encoding

How do I perform frequency/count encoding for a train and test set?

The implementations of this encoding I've seen simply frequency encode the categorical variables on a particular dataset (no specific train, and test encoding transformation). For instance:

dataset.groupby("cat_column").size()/len(dataset)

In my case now I have a train, and test set.

[First option] Is it okay (due to leakage? or there won't?) for me to use frequency encoding on the whole dataset. OR

[Second option] I should take into consideration train, and test set independence.

If the second option, how do I do this?

  1. Encode the train set, then use the encoding values of categories in the train set for the same categories in the test set. Categories not represented in the test set would be need to be handled. OR
  2. There's a better generic implementation?

Topic data-leakage encoding categorical-data machine-learning

Category Data Science


Frequency Encoding

It is a way to utilize the frequency of the categories as labels. In the cases where the frequency is related somewhat to the target variable, it helps the model to understand and assign the weight in direct and inverse proportion, depending on the nature of the data. Replace the categories with the count of the observations that show that category in the dataset. Similarly, we can replace the category by the frequency -or percentage- of observations in the dataset.

Implementation of Frequency Encoding:

  1. Using category_encoders, zowlex's answer
  2. Using feature-engine - CountFrequencyEncoder

feature-engine - CountFrequencyEncoder

let’s load the data and separate it into train and test:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

from feature_engine.encoding import CountFrequencyEncoder

# Load dataset
def load_titanic():
        data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
        data = data.replace('?', np.nan)
        data['cabin'] = data['cabin'].astype(str).str[0]
        data['pclass'] = data['pclass'].astype('O')
        data['embarked'].fillna('C', inplace=True)
        return data

data = load_titanic()

# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
                data.drop(['survived', 'name', 'ticket'], axis=1),
                data['survived'], test_size=0.3, random_state=0)

set up the CountFrequencyEncoder() to replace the categories by their frequencies, only in the 3 indicated variables:

# set up the encoder
encoder = CountFrequencyEncoder(encoding_method='frequency',
                         variables=['cabin', 'pclass', 'embarked'])

# fit the encoder
encoder.fit(X_train)

With fit() the encoder learns the frequencies of each category, which are stored in its encoder_dict_ parameter:

encoder.encoder_dict_

We can now go ahead and replace the original strings with the numbers:

# transform the data
train_t= encoder.transform(X_train)
test_t= encoder.transform(X_test)

I hope this is not a late answer but actually you can use category_encoders library, it follows sklearn's style.

Example:

import category_encoders as ce 

#I'll pretend that you've already split your data into train/test

#your categorical features
cat_features = ['cat_feature1', 'cat_feature2']

#count encoder 
count_encoder = ce.CountEncoder(cols=cat_features)
count_encoder.fit(train[cat_features])

train = train.join(count_encoder.transform(train[cols]).add_suffix('_count'))
test = test.join(count_encoder.transform(test[cols]).add_suffix('_count'))

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.