target-encoding

Dealing with observation with arbitrary number of categories with arbitary number of values

King Powa

2022年5月26日 17:21

Suppose to have a set of elements $X = \{x_1, x_2, ..., x_n\}$. Each element is characterised by a set of features. The features characterising a particular element $x_i$ can belong to one of $q$ different categories. Each different category $f_q$ can have a different value $v_{q_i}$, belonging to a set of possible values $V_q = \{v_{q_1}, v_{q_2} ...\}$. So, an observation $x_i$ may be described as $x_i = \{f_{q_1} = v_{{q_1}_i}, f_{q_1} = v_{{q_1}_j}, ... f_{q_i} = v_{{q_i}_i}\}$. In …

Topic: target-encoding categorical-encoding machine-learning

Category: Data Science

Does it make sense to use target encoding together with tree-based models?

KJA

2022年5月17日 13:28

I'm working on a regression problem with a few high-cardinality categorical features (Forecasting different items with a single model). Someone suggested to use target-encoding (mean/median of the target of each item) together with xgboost. While I understand how this new feature would improve a linear model (or GMM'S in general) I do not understand how this approach would fit into a tree-based model (Regression Trees, Random Forest, Boosting). Given the feature is used for splitting, items with a mean below …

Topic: target-encoding categorical-encoding xgboost random-forest

Category: Data Science

Is normalization needed for TargetEncoded Variables?

lte__

2022年4月11日 21:47

Basically the title. If I encode the address of people (the cities they live in) with a target encoder, do I still need to normalize that column? Of course, the capital is going to have more citizens and bigger cities also, so it looks kinda like an exponential distribution. In such a case, is normalization still needed (via a log transform for example), or are target encoded variables enough? Why? Thank you!

Topic: target-encoding normalization data-cleaning

Category: Data Science

Target encoding with KFold cross-validation - how to transform test set?

Xaume

2022年3月27日 08:28

Let's say I have a categorical feature (cat): import random import pandas as pd from sklearn.model_selection import train_test_split, StratifiedKFold random.seed(1234) y = random.choices([1, 0], weights=[0.2, 0.8], k=100) cat = random.choices(["A", "B", "C"], k=100) df = pd.DataFrame.from_dict({"y": y, "cat": cat}) and I want to use target encoding with regularisation using CV like below: X_train, X_test, y_train, y_test = train_test_split(df[["cat"]], df["y"], train_size=0.8, random_state=42) df_train = pd.concat([X_train, y_train], axis=1).sort_index() df_train["kfold"] = -1 idx = df_train.index df_train = df_train.sample(frac=1) skf = StratifiedKFold(n_splits=5) for fold_id, …

Topic: target-encoding categorical-encoding scikit-learn statistics machine-learning

Category: Data Science

Predict apartment prices with two sources of prices

glycine-addict

2022年3月17日 08:04

I am asking for help with the following problem. There are two subsamples in the dataset - one where the target is real(valid), and the other where it is approximate (I do not know how it differs yet, on one sample the real price of an apartment, and on the other the price from ads, you need to predict the real one, of course). Any ideas about what to do about this? I have two ideas - to normalize the …

Topic: target-encoding regression

Category: Data Science

Logistic Regression Multi-level Independent variables

PureAnalytics

2022年3月16日 10:03

im trying to study logistic regression, when i did the target variable with all features, i had the summary showing the p-values as usual, but one for the features has 60 level, another feature has 13 level, so how can i proceed with this kind of data, knowing that some of these level has significant low p-values but others dont, so i cant drop the feature completely for example below is a sample of the summary, please your advise Coefficients: …

Topic: target-encoding feature-engineering logistic-regression predictive-modeling data-cleaning

Category: Data Science

One-Hot-Encoding Target variable

Adnan Khan

2021年11月15日 23:38

I have a dataset that consists of 4 values in a target variable. I have performed Ordinal Encoding over that which worked for me but my question here's that if I apply one-hot encoding can I solve this problem?. As it would be 4 new columns that are generated from a single target variable. |classes|classes_a|classes_b|classes_c|classes_d |a |1 |0 |0 |0 |------ |---------|---------|---------|--------- |b |0 |1 |0 |0 |------ |---------|---------|---------|--------- |c |0 |0 |1 |0 |-------|---------|---------|---------|--------- |d |0 |0 |0 |1 …

Topic: target-encoding one-hot-encoding machine-learning

Category: Data Science

target encoding with multiple columns

Rupert

2021年8月24日 11:24

I'm attempting to do target encoding with multiple columns on a dataframe and I'm getting an error message I don't understand. Here is a fragment of the code. X['District Code Encoded'] = encoder.fit_transform(X['District Code'], y) X['Property id Encoded'] = encoder.fit_transform(X['Property id'],y) X['Property name Encoded'] = encoder.fit_transform(X['Property name'],y) It always runs the first line and then throws an error message on the second line giving a key error along with the key that occurs in the second pair of square brackets …

Topic: target-encoding

Category: Data Science

Why is my validation score so much higher using TargetEncoder?

JohnDoe

2021年6月29日 08:21

So I'm experimenting a bit with an XGBoost model & encoding the categorical variables using the target encoder from the category_encoders library. The code below shows how I split the dataset and fit the target encoder. X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.2, random_state=70) ce_enc = ce.TargetEncoder() X_train[encode_name_lst] = ce_enc.fit_transform(X_train[encode_name_lst], y_train) X_test[encode_name_lst] = ce_enc.transform(X_test[encode_name_lst]) Now when I start training on the dataset using cross validation I see very good scores on the validation set (an AUC of ~0.92). But …

Topic: target-encoding xgboost

Category: Data Science

Categories with the same mean in target encoding

Carlos Mougan

2021年3月8日 12:34

While doing target encoding it can happen that two categories have the same target mean. This is bad because there will be no difference in the new feature in it and we will lose some information. Also, this is potentially harmful to the model, choosing this split in the feature can produce some incongruences. Is there any way to fix this problem?

Topic: target-encoding categorical-encoding categorical-data machine-learning

Category: Data Science

why should i do target encoding within cv loop?

Maths12

2020年12月7日 23:24

i wish to use target encoding, using the category encoders sklearn library. I don't really understand why it is necessary to include this as a step in a sklearn pipeline WITHIN the cross validation loop? e.g. this example here does so Target encoding with KFold cross-validation - how to transform test set? my methodoloigy is similar to the one in the link except i do not use any smoothing. my dataset is quite large around 300-500k. however looking at my …

Topic: target-encoding categorical-encoding overfitting cross-validation

Category: Data Science

How Should I deal with my imbalanced binary target

Martin Xristev

2020年10月31日 02:52

I am trying to model my data with Python and i am having concerns about my binary target variable, because it has 90% cases falling in 0 and 10% of the cases falling in 1. I have tried upsampling my data and i got twice more observations than i had. I am not sure is it right to do it this way.

Topic: target-encoding sampling class-imbalance algorithms machine-learning

Category: Data Science

Interpreting decision tree results after target encoding

bob2

2020年9月12日 14:41

I am not sure how to interpret the results of my decision tree after I had used target encoding, could someone clarify? The example below doesn't need target encoding just for explanation of my confusion here. For instance I am trying to classify if a fruit is rotten or not given its age and fruit type. I use target encoding for the fruit column: I then get the following decision tree with default sklearn decision tree classifier parameters: I believe …

Topic: target-encoding decision-trees

Category: Data Science

About