Using a Subset of Categories in a Categorical Column

I have a XGBoost model and I'm going to retrain it by adding new features. There is a column in my data and it's about professions of the customers. It has 60 categories. I suppose there is no need to convert them to dummy variables because tree based models can handle them, but I thought that there should be many splits in order to do it and I decided to use a subset of categories and group other categories under one category. To decide categories which I'll keep I applied one-hot encoding to all of them and applied chi-square test by using scipy.stats.chi2_contingency to each generated dummy column and target variable. Then I sorted columns by test statistic in ascending order and picked first 10 of them. Then in the original column I kept the values that also in the subset and assigned same category to others. I'm not sure it's a proper method or is there any inconsistency in it? Any suggestions?

My code is as below:

from scipy.stats import chi2_contingency

def get_n_cats(df, col_to_cat, n, target="TARGET_GPL_SATIS"):
    # Apply one-hot encoding
    dummies = pd.get_dummies(df[col_to_cat])

    # Calculate chi-square statistic for each dummy column
    scores = pd.DataFrame(index=dummies.columns, columns=["Score"])
    for col in dummies.columns:
        cont_table = contingency_table(dummies[col], df[target])
        score = chi2_contingency(cont_table)[0]
        scores.loc[col,"Score"] = score

    # Sort by score and get first n columns
    scores.sort_values(by="Score", ascending=True, inplace=True)
    first_n_columns = scores.index[:n]

    # Create mapping dict
    mapping = create_mapping(first_n_columns)

    # Preserve only first n columns as categories
    return [mapping[val] if val in mapping.keys() else mapping["other"] for val in df[col_to_cat]]

def contingency_table(c1, c2):
    """Calculates contingency table between provided columns."""
    df = pd.DataFrame({"c1":c1, "c2":c2})
    return df.groupby(['c1','c2']).size().unstack(fill_value=0).values

def create_mapping(first_n_columns):
    """Creates mapping dict."""
    mapping = {cat:code+1 for code, cat in enumerate(first_n_columns)}
    mapping[np.nan] = 0
    mapping["other"] = len(first_n_columns) + 1
    return mapping

Topic chi-square-test one-hot-encoding xgboost categorical-data

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.