categorical-encoding

Turning multiple binary columns into categorical (with less columns) with Python Pandas

Legna

2022年6月3日 22:24

I want to turn these categories into values of categorical columns. The values in each category are the current binary columns present in the data frame. We have : A11, A12.. is a detail of A1 so if the value in A11 ==1 it will necessarily imply having A1==1 but the inverse is not valid. Respecting the following conditions : maximaum of existing types is 4 if A11==1 value of type1 should be equal to 'A11' and we ignore 'A1' …

Topic: categorical-encoding dataframe pandas python data-cleaning

Category: Data Science

Handling encoding of a dataset which has more than total 2000 columns

Sahil

2022年6月3日 11:09

Whenever we have a dataset to be pre processed, before feeding it to the model we convert the categorical values to numerical values for which we generally use LabelEncoding, One Hot encoding etc techniques but all these are done manually going through each column. But what if are dataset is huge in terms of columns(eg : 2000 columns), here it wont be possible to go through each column manually, in such cases how do we handle encoding? Are there any …

Topic: categorical-encoding encoding

Category: Data Science

Dealing with observation with arbitrary number of categories with arbitary number of values

King Powa

2022年5月26日 17:21

Suppose to have a set of elements $X = \{x_1, x_2, ..., x_n\}$. Each element is characterised by a set of features. The features characterising a particular element $x_i$ can belong to one of $q$ different categories. Each different category $f_q$ can have a different value $v_{q_i}$, belonging to a set of possible values $V_q = \{v_{q_1}, v_{q_2} ...\}$. So, an observation $x_i$ may be described as $x_i = \{f_{q_1} = v_{{q_1}_i}, f_{q_1} = v_{{q_1}_j}, ... f_{q_i} = v_{{q_i}_i}\}$. In …

Topic: target-encoding categorical-encoding machine-learning

Category: Data Science

NAN in keras neural network results

Ayan Mitra

2022年5月24日 03:40

I am creating a neural network simple architecture. But I keep getting NAN in result, cant figure out why, below is my code. import pandas from keras.models import Sequential from keras.layers import Dense from keras.wrappers.scikit_learn import KerasClassifier from keras.utils import np_utils from sklearn.model_selection import cross_val_score from sklearn.model_selection import KFold from sklearn.preprocessing import LabelEncoder from sklearn.pipeline import Pipeline from collections import Counter from sklearn.metrics import classification_report, confusion_matrix from sklearn.preprocessing import StandardScaler #from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder from tensorflow.keras …

Topic: categorical-encoding keras logistic-regression neural-network python

Category: Data Science

Cat2Vec implementation X = categorical and y = categorical

Snader

2022年5月18日 15:23

I am trying to convert categorical values (zipcodes) with Cat2Vec into a matrix which can be used as an input shape for categorical prediction of a target with binary values. After reading several articles, among which: https://www.yanxishe.com/TextTranslation/1656?from=csdn I am having trouble to understand two things: 1) With respect to which y in Cat2Vec encoding are you creating embeddings. Is it with respect to the actual target in the dataset you are trying to predict, or can you randomly choose any …

Topic: categorical-encoding word2vec deep-learning

Category: Data Science

Does it make sense to use target encoding together with tree-based models?

KJA

2022年5月17日 13:28

I'm working on a regression problem with a few high-cardinality categorical features (Forecasting different items with a single model). Someone suggested to use target-encoding (mean/median of the target of each item) together with xgboost. While I understand how this new feature would improve a linear model (or GMM'S in general) I do not understand how this approach would fit into a tree-based model (Regression Trees, Random Forest, Boosting). Given the feature is used for splitting, items with a mean below …

Topic: target-encoding categorical-encoding xgboost random-forest

Category: Data Science

How to deal with address (like zip-code) for training a model?

aRedDish

2022年5月4日 20:11

To me it doesn't make sense to normalize it even if it is a numerical variable like Zip Code. An address should be interpreted as categorical features like "neighborhood"... ? Suppose I have geolocalisation data (latitude & longitude), the best thing to do seem to use k-means clustering and then working with cluster's label that I "encode". If the answer is : "it depends" please tell me how

Topic: categorical-encoding geospatial machine-learning

Category: Data Science

Encode each comma separated value in Pandas

spd

2022年5月1日 04:14

I have a dataset Inp1 Inp2 Inp3 Output A,B,C AI,UI,JI Apple,Bat,Dog Animals L,M,N LI,DO,LI Lawn, Moon, Noon Noun X,Y AI,UI Yemen,Zombie Extras For these values, I need to apply a ML algorithm. Hence need an encoding technique. Tried a Label encoding technique, it encodes the entire cell to an int for eg. Inp1 Inp2 Inp3 Output 5 4 8 0 But I need a separate encoding for each value in a cell. How should I go about it. Inp1 Inp2 …

Topic: categorical-encoding one-hot-encoding python-3.x pandas categorical-data

Category: Data Science

How to create the categorical mask for images specifically for Tensor? Or port the NumPy function correctly to Dataset.map function

Maifee Ul Asad

2022年4月25日 10:20

I'm trying to move from NumPy array as my dataset to tensorflow.Dataset. Now, I've created a pipeline to train the model for classification problems. At some point, I just normalize all the images using map function: dataset['train'] = dataset['train'].map(pre_pr, num_parallel_calls=tf.data.experimental.AUTOTUNE) And the function description looks like this: @tf.function def normalize(input_image: tf.Tensor, input_mask: tf.Tensor) -> tuple: input_image = tf.cast(input_image, tf.float32) / 255.0 input_mask= tf.cast(input_mask, tf.float32) / 255.0 return input_image, input_mask @tf.function def pre_pr(datapoint: dict) -> tuple: input_image = tf.image.resize(datapoint['image'], (IMG_SIZE, IMG_SIZE)) …

Topic: semantic-segmentation categorical-encoding keras tensorflow

Category: Data Science

How do I get the mean values that are greater than .5 for my model?

bulldog23

2022年4月14日 09:38

I am trying to build a classification model. One of the variables called specialty has 200 values. Based on a previous post I saw, I decided I wanted to include the values that have the highest mean. I am thinking greater than 0.5. How would I filter the specialty to have only values greater than 0.5 for the mean? I am trying to get my final dataset ready for machine learning. Any advice is appreciated.

Topic: categorical-encoding logistic-regression classification categorical-data

Category: Data Science

What to do if a specific label of a category appears only a few times?

2022年3月31日 03:01

Let's say I am trying to predict whether a car will be auctioned or not (not what I'm actually trying to do, but it represents it pretty well) using tabular data. I have the year the car was made, its color, model, etc. The model is the name of a car(e.g: Sportage, Mazda3, etc.) and some of the more famous models such as Sportage appear many times whereas some of the less popular ones might appear only once or twice. …

Topic: categorical-encoding data classification dataset categorical-data

Category: Data Science

Target encoding with KFold cross-validation - how to transform test set?

Xaume

2022年3月27日 08:28

Let's say I have a categorical feature (cat): import random import pandas as pd from sklearn.model_selection import train_test_split, StratifiedKFold random.seed(1234) y = random.choices([1, 0], weights=[0.2, 0.8], k=100) cat = random.choices(["A", "B", "C"], k=100) df = pd.DataFrame.from_dict({"y": y, "cat": cat}) and I want to use target encoding with regularisation using CV like below: X_train, X_test, y_train, y_test = train_test_split(df[["cat"]], df["y"], train_size=0.8, random_state=42) df_train = pd.concat([X_train, y_train], axis=1).sort_index() df_train["kfold"] = -1 idx = df_train.index df_train = df_train.sample(frac=1) skf = StratifiedKFold(n_splits=5) for fold_id, …

Topic: target-encoding categorical-encoding scikit-learn statistics machine-learning

Category: Data Science

Handling date and time fields for classification task

insomniac

2022年3月17日 10:58

I'm working on a classification task(The dataset is 400,000 rows and 30 columns) and one of my features was date-time. I've extracted the month, day of the week, and hour from the dataset (year is a single value and I don't think minutes will have much influence). Since they're now categorical variables how do I deal with them? Should I leave them as a single row or use one-hot encoding or go for target encoding?

Topic: categorical-encoding time classification categorical-data

Category: Data Science

How to handle categorical variables with Random Forest using Scikit Learn?

Fred Chang

2022年3月14日 21:09

One of the variables/features is the department id, which is like 1001, 1002, ..., 1218, etc. The ids are nominal, not ordinal, i.e., they are just ids, department 1002 is by no means higher than department 1001. I feed the feature to random forest using Scikit Learn. How should I deal with it? Some people say to use one-hot encoding. However, Some others say the one-hot encoding degrades random forest's performance. Also, I do have over 200 departments, so I …

Topic: categorical-encoding one-hot-encoding random-forest

Category: Data Science

One-hot & interaction one-hot on multiple categorical

Artur Motruk

2022年3月14日 13:04

I was wondering if there is any value to creating combined features out of multiple categorical variables when the individual categorical variables are already one-hot encoded? Simple example: there is a variable P with categories {X, Y} and a variable Q with categories {Z, W}. After one-hot, we would have 4 variables: P.X, P.Y, Q.Z, and Q.W. In this scenario, I'm wondering if the algorithm (Xgboost or a deep neural network) would sufficiently learn interaction effects between these or is …

Topic: categorical-encoding one-hot-encoding feature-engineering xgboost neural-network

Category: Data Science

Categorical feature encoding

Rose

2022年3月4日 06:04

I am making a classification model. I have categorical and continuous data. The categorical columns include columns with 2 classes such as sex (male, female), and multi-class columns such as location. I need to encode these to numeric values. I would do one-hot-encoding and drop first column but it is not realistic on an unseen test data that may have unseen values. so I have planned to do one-hot-encoding with handle_unknown='ignore' . However, my problem is that I am afraid …

Topic: categorical-encoding one-hot-encoding encoding classification machine-learning

Category: Data Science

Categorical Variable Embedding

Agile

2022年2月28日 05:18

I have a categorical variable in my labeled dataset. I trained one-hot encoded version of it in another neural network having embedding layer with a larger labeled dataset. I have obtained the weights of embedding layer. Is it possible to use embedding layer weights as a categorical variable representation like one-hot-encoding while using it in another network which has no embedding layer? For example, One-hot-encoded variable, A B C D D 0 0 0 1 B 0 1 0 0 …

Topic: categorical-encoding embeddings

Category: Data Science

How to do target encoding when data has repeated rows?

pierround

2022年2月25日 03:04

How can I do encoding for a category when data has repeated rows? Can I do target encoding? Or Is there another encoding I can use? I want to figure how to include a categorical variable in a model to predict a numerical variable Y. Because I am working with some legislative data, my challenge is my category code is over 4000 unique values, those values that cannot be easily grouped(*), and they can have repeats. In fact, anecdotally I …

Topic: categorical-encoding feature-engineering encoding

Category: Data Science

What is the best practice to normalize/standardize imbalanced data for outlier detection or binary classification task?

Mario

2022年2月21日 19:03

I'm researching Anomaly/outlier/fraud detection, and I'm looking for the best practice to pre-process the synthetic data for imbalanced data. I have checked all methodology for normalizing/standardizing, which are not sensitive to the presence of outliers and fit this case study. Based on scikit-learn 0.24.2 study about Compare the effect of different scalers on data with outliers, it has been stated here: If some outliers are present in the set, robust scalers or transformers are more appropriate. I'm using CTU-13 dataset, …

Topic: binary-classification categorical-encoding imbalanced-data normalization anomaly-detection

Category: Data Science

Should one-hot encoded categorical features needs to be scaled when used along with text feature while deriving semantic similarity?

Bruso

2022年2月21日 05:04

My aim is to derive textual similarity using multiple features. Some of the features are textual for which I am using (Tfhub 2.0) Universal Sentence encoder. There are other categorical features which are encoded using one-hot encoder. For example, for a single record in my dataset, feature vector looks like this: text feature's embedding is 512 dimension vector - 1 X 512 categorical (non-ordered) feature vector - 1 X 500 (since there are 500 unique values in the feature) my …

Topic: categorical-encoding semantic-similarity feature-scaling

Category: Data Science

About