Dealing with observation with arbitrary number of categories with arbitary number of values

Suppose to have a set of elements $X = \{x_1, x_2, ..., x_n\}$. Each element is characterised by a set of features. The features characterising a particular element $x_i$ can belong to one of $q$ different categories. Each different category $f_q$ can have a different value $v_{q_i}$, belonging to a set of possible values $V_q = \{v_{q_1}, v_{q_2} ...\}$. So, an observation $x_i$ may be described as $x_i = \{f_{q_1} = v_{{q_1}_i}, f_{q_1} = v_{{q_1}_j}, ... f_{q_i} = v_{{q_i}_i}\}$. In …
Category: Data Science

Does it make sense to use target encoding together with tree-based models?

I'm working on a regression problem with a few high-cardinality categorical features (Forecasting different items with a single model). Someone suggested to use target-encoding (mean/median of the target of each item) together with xgboost. While I understand how this new feature would improve a linear model (or GMM'S in general) I do not understand how this approach would fit into a tree-based model (Regression Trees, Random Forest, Boosting). Given the feature is used for splitting, items with a mean below …
Category: Data Science

Is normalization needed for TargetEncoded Variables?

Basically the title. If I encode the address of people (the cities they live in) with a target encoder, do I still need to normalize that column? Of course, the capital is going to have more citizens and bigger cities also, so it looks kinda like an exponential distribution. In such a case, is normalization still needed (via a log transform for example), or are target encoded variables enough? Why? Thank you!
Category: Data Science

Target encoding with KFold cross-validation - how to transform test set?

Let's say I have a categorical feature (cat): import random import pandas as pd from sklearn.model_selection import train_test_split, StratifiedKFold random.seed(1234) y = random.choices([1, 0], weights=[0.2, 0.8], k=100) cat = random.choices(["A", "B", "C"], k=100) df = pd.DataFrame.from_dict({"y": y, "cat": cat}) and I want to use target encoding with regularisation using CV like below: X_train, X_test, y_train, y_test = train_test_split(df[["cat"]], df["y"], train_size=0.8, random_state=42) df_train = pd.concat([X_train, y_train], axis=1).sort_index() df_train["kfold"] = -1 idx = df_train.index df_train = df_train.sample(frac=1) skf = StratifiedKFold(n_splits=5) for fold_id, …
Category: Data Science

Predict apartment prices with two sources of prices

I am asking for help with the following problem. There are two subsamples in the dataset - one where the target is real(valid), and the other where it is approximate (I do not know how it differs yet, on one sample the real price of an apartment, and on the other the price from ads, you need to predict the real one, of course). Any ideas about what to do about this? I have two ideas - to normalize the …
Category: Data Science

Logistic Regression Multi-level Independent variables

im trying to study logistic regression, when i did the target variable with all features, i had the summary showing the p-values as usual, but one for the features has 60 level, another feature has 13 level, so how can i proceed with this kind of data, knowing that some of these level has significant low p-values but others dont, so i cant drop the feature completely for example below is a sample of the summary, please your advise Coefficients: …
Category: Data Science

One-Hot-Encoding Target variable

I have a dataset that consists of 4 values in a target variable. I have performed Ordinal Encoding over that which worked for me but my question here's that if I apply one-hot encoding can I solve this problem?. As it would be 4 new columns that are generated from a single target variable. |classes|classes_a|classes_b|classes_c|classes_d |a |1 |0 |0 |0 |------ |---------|---------|---------|--------- |b |0 |1 |0 |0 |------ |---------|---------|---------|--------- |c |0 |0 |1 |0 |-------|---------|---------|---------|--------- |d |0 |0 |0 |1 …
Category: Data Science

target encoding with multiple columns

I'm attempting to do target encoding with multiple columns on a dataframe and I'm getting an error message I don't understand. Here is a fragment of the code. X['District Code Encoded'] = encoder.fit_transform(X['District Code'], y) X['Property id Encoded'] = encoder.fit_transform(X['Property id'],y) X['Property name Encoded'] = encoder.fit_transform(X['Property name'],y) It always runs the first line and then throws an error message on the second line giving a key error along with the key that occurs in the second pair of square brackets …
Category: Data Science

Why is my validation score so much higher using TargetEncoder?

So I'm experimenting a bit with an XGBoost model & encoding the categorical variables using the target encoder from the category_encoders library. The code below shows how I split the dataset and fit the target encoder. X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.2, random_state=70) ce_enc = ce.TargetEncoder() X_train[encode_name_lst] = ce_enc.fit_transform(X_train[encode_name_lst], y_train) X_test[encode_name_lst] = ce_enc.transform(X_test[encode_name_lst]) Now when I start training on the dataset using cross validation I see very good scores on the validation set (an AUC of ~0.92). But …
Category: Data Science

Categories with the same mean in target encoding

While doing target encoding it can happen that two categories have the same target mean. This is bad because there will be no difference in the new feature in it and we will lose some information. Also, this is potentially harmful to the model, choosing this split in the feature can produce some incongruences. Is there any way to fix this problem?
Category: Data Science

why should i do target encoding within cv loop?

i wish to use target encoding, using the category encoders sklearn library. I don't really understand why it is necessary to include this as a step in a sklearn pipeline WITHIN the cross validation loop? e.g. this example here does so Target encoding with KFold cross-validation - how to transform test set? my methodoloigy is similar to the one in the link except i do not use any smoothing. my dataset is quite large around 300-500k. however looking at my …
Category: Data Science

Interpreting decision tree results after target encoding

I am not sure how to interpret the results of my decision tree after I had used target encoding, could someone clarify? The example below doesn't need target encoding just for explanation of my confusion here. For instance I am trying to classify if a fruit is rotten or not given its age and fruit type. I use target encoding for the fruit column: I then get the following decision tree with default sklearn decision tree classifier parameters: I believe …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.