Using a Subset of Categories in a Categorical Column
I have a XGBoost model and I'm going to retrain it by adding new features. There is a column in my data and it's about professions of the customers. It has 60 categories. I suppose there is no need to convert them to dummy variables because tree based models can handle them, but I thought that there should be many splits in order to do it and I decided to use a subset of categories and group other categories under one category. To decide categories which I'll keep I applied one-hot encoding to all of them and applied chi-square test by using scipy.stats.chi2_contingency to each generated dummy column and target variable. Then I sorted columns by test statistic in ascending order and picked first 10 of them. Then in the original column I kept the values that also in the subset and assigned same category to others. I'm not sure it's a proper method or is there any inconsistency in it? Any suggestions?
My code is as below:
from scipy.stats import chi2_contingency
def get_n_cats(df, col_to_cat, n, target="TARGET_GPL_SATIS"):
# Apply one-hot encoding
dummies = pd.get_dummies(df[col_to_cat])
# Calculate chi-square statistic for each dummy column
scores = pd.DataFrame(index=dummies.columns, columns=["Score"])
for col in dummies.columns:
cont_table = contingency_table(dummies[col], df[target])
score = chi2_contingency(cont_table)[0]
scores.loc[col,"Score"] = score
# Sort by score and get first n columns
scores.sort_values(by="Score", ascending=True, inplace=True)
first_n_columns = scores.index[:n]
# Create mapping dict
mapping = create_mapping(first_n_columns)
# Preserve only first n columns as categories
return [mapping[val] if val in mapping.keys() else mapping["other"] for val in df[col_to_cat]]
def contingency_table(c1, c2):
"""Calculates contingency table between provided columns."""
df = pd.DataFrame({"c1":c1, "c2":c2})
return df.groupby(['c1','c2']).size().unstack(fill_value=0).values
def create_mapping(first_n_columns):
"""Creates mapping dict."""
mapping = {cat:code+1 for code, cat in enumerate(first_n_columns)}
mapping[np.nan] = 0
mapping["other"] = len(first_n_columns) + 1
return mapping
Topic chi-square-test one-hot-encoding xgboost categorical-data
Category Data Science