Turning multiple binary columns into categorical (with less columns) with Python Pandas

I want to turn these categories into values of categorical columns. The values in each category are the current binary columns present in the data frame. We have : A11, A12.. is a detail of A1 so if the value in A11 ==1 it will necessarily imply having A1==1 but the inverse is not valid. Respecting the following conditions : maximaum of existing types is 4 if A11==1 value of type1 should be equal to 'A11' and we ignore 'A1' …
Category: Data Science

Handling encoding of a dataset which has more than total 2000 columns

Whenever we have a dataset to be pre processed, before feeding it to the model we convert the categorical values to numerical values for which we generally use LabelEncoding, One Hot encoding etc techniques but all these are done manually going through each column. But what if are dataset is huge in terms of columns(eg : 2000 columns), here it wont be possible to go through each column manually, in such cases how do we handle encoding? Are there any …
Category: Data Science

Dealing with observation with arbitrary number of categories with arbitary number of values

Suppose to have a set of elements $X = \{x_1, x_2, ..., x_n\}$. Each element is characterised by a set of features. The features characterising a particular element $x_i$ can belong to one of $q$ different categories. Each different category $f_q$ can have a different value $v_{q_i}$, belonging to a set of possible values $V_q = \{v_{q_1}, v_{q_2} ...\}$. So, an observation $x_i$ may be described as $x_i = \{f_{q_1} = v_{{q_1}_i}, f_{q_1} = v_{{q_1}_j}, ... f_{q_i} = v_{{q_i}_i}\}$. In …
Category: Data Science

NAN in keras neural network results

I am creating a neural network simple architecture. But I keep getting NAN in result, cant figure out why, below is my code. import pandas from keras.models import Sequential from keras.layers import Dense from keras.wrappers.scikit_learn import KerasClassifier from keras.utils import np_utils from sklearn.model_selection import cross_val_score from sklearn.model_selection import KFold from sklearn.preprocessing import LabelEncoder from sklearn.pipeline import Pipeline from collections import Counter from sklearn.metrics import classification_report, confusion_matrix from sklearn.preprocessing import StandardScaler #from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder from tensorflow.keras …
Category: Data Science

Cat2Vec implementation X = categorical and y = categorical

I am trying to convert categorical values (zipcodes) with Cat2Vec into a matrix which can be used as an input shape for categorical prediction of a target with binary values. After reading several articles, among which: https://www.yanxishe.com/TextTranslation/1656?from=csdn I am having trouble to understand two things: 1) With respect to which y in Cat2Vec encoding are you creating embeddings. Is it with respect to the actual target in the dataset you are trying to predict, or can you randomly choose any …
Category: Data Science

Does it make sense to use target encoding together with tree-based models?

I'm working on a regression problem with a few high-cardinality categorical features (Forecasting different items with a single model). Someone suggested to use target-encoding (mean/median of the target of each item) together with xgboost. While I understand how this new feature would improve a linear model (or GMM'S in general) I do not understand how this approach would fit into a tree-based model (Regression Trees, Random Forest, Boosting). Given the feature is used for splitting, items with a mean below …
Category: Data Science

How to deal with address (like zip-code) for training a model?

To me it doesn't make sense to normalize it even if it is a numerical variable like Zip Code. An address should be interpreted as categorical features like "neighborhood"... ? Suppose I have geolocalisation data (latitude & longitude), the best thing to do seem to use k-means clustering and then working with cluster's label that I "encode". If the answer is : "it depends" please tell me how
Category: Data Science

Encode each comma separated value in Pandas

I have a dataset Inp1 Inp2 Inp3 Output A,B,C AI,UI,JI Apple,Bat,Dog Animals L,M,N LI,DO,LI Lawn, Moon, Noon Noun X,Y AI,UI Yemen,Zombie Extras For these values, I need to apply a ML algorithm. Hence need an encoding technique. Tried a Label encoding technique, it encodes the entire cell to an int for eg. Inp1 Inp2 Inp3 Output 5 4 8 0 But I need a separate encoding for each value in a cell. How should I go about it. Inp1 Inp2 …
Category: Data Science

How to create the categorical mask for images specifically for Tensor? Or port the NumPy function correctly to Dataset.map function

I'm trying to move from NumPy array as my dataset to tensorflow.Dataset. Now, I've created a pipeline to train the model for classification problems. At some point, I just normalize all the images using map function: dataset['train'] = dataset['train'].map(pre_pr, num_parallel_calls=tf.data.experimental.AUTOTUNE) And the function description looks like this: @tf.function def normalize(input_image: tf.Tensor, input_mask: tf.Tensor) -> tuple: input_image = tf.cast(input_image, tf.float32) / 255.0 input_mask= tf.cast(input_mask, tf.float32) / 255.0 return input_image, input_mask @tf.function def pre_pr(datapoint: dict) -> tuple: input_image = tf.image.resize(datapoint['image'], (IMG_SIZE, IMG_SIZE)) …
Category: Data Science

How do I get the mean values that are greater than .5 for my model?

I am trying to build a classification model. One of the variables called specialty has 200 values. Based on a previous post I saw, I decided I wanted to include the values that have the highest mean. I am thinking greater than 0.5. How would I filter the specialty to have only values greater than 0.5 for the mean? I am trying to get my final dataset ready for machine learning. Any advice is appreciated.
Category: Data Science

What to do if a specific label of a category appears only a few times?

Let's say I am trying to predict whether a car will be auctioned or not (not what I'm actually trying to do, but it represents it pretty well) using tabular data. I have the year the car was made, its color, model, etc. The model is the name of a car(e.g: Sportage, Mazda3, etc.) and some of the more famous models such as Sportage appear many times whereas some of the less popular ones might appear only once or twice. …
Category: Data Science

Target encoding with KFold cross-validation - how to transform test set?

Let's say I have a categorical feature (cat): import random import pandas as pd from sklearn.model_selection import train_test_split, StratifiedKFold random.seed(1234) y = random.choices([1, 0], weights=[0.2, 0.8], k=100) cat = random.choices(["A", "B", "C"], k=100) df = pd.DataFrame.from_dict({"y": y, "cat": cat}) and I want to use target encoding with regularisation using CV like below: X_train, X_test, y_train, y_test = train_test_split(df[["cat"]], df["y"], train_size=0.8, random_state=42) df_train = pd.concat([X_train, y_train], axis=1).sort_index() df_train["kfold"] = -1 idx = df_train.index df_train = df_train.sample(frac=1) skf = StratifiedKFold(n_splits=5) for fold_id, …
Category: Data Science

Handling date and time fields for classification task

I'm working on a classification task(The dataset is 400,000 rows and 30 columns) and one of my features was date-time. I've extracted the month, day of the week, and hour from the dataset (year is a single value and I don't think minutes will have much influence). Since they're now categorical variables how do I deal with them? Should I leave them as a single row or use one-hot encoding or go for target encoding?
Category: Data Science

How to handle categorical variables with Random Forest using Scikit Learn?

One of the variables/features is the department id, which is like 1001, 1002, ..., 1218, etc. The ids are nominal, not ordinal, i.e., they are just ids, department 1002 is by no means higher than department 1001. I feed the feature to random forest using Scikit Learn. How should I deal with it? Some people say to use one-hot encoding. However, Some others say the one-hot encoding degrades random forest's performance. Also, I do have over 200 departments, so I …
Category: Data Science

One-hot & interaction one-hot on multiple categorical

I was wondering if there is any value to creating combined features out of multiple categorical variables when the individual categorical variables are already one-hot encoded? Simple example: there is a variable P with categories {X, Y} and a variable Q with categories {Z, W}. After one-hot, we would have 4 variables: P.X, P.Y, Q.Z, and Q.W. In this scenario, I'm wondering if the algorithm (Xgboost or a deep neural network) would sufficiently learn interaction effects between these or is …
Category: Data Science

Categorical feature encoding

I am making a classification model. I have categorical and continuous data. The categorical columns include columns with 2 classes such as sex (male, female), and multi-class columns such as location. I need to encode these to numeric values. I would do one-hot-encoding and drop first column but it is not realistic on an unseen test data that may have unseen values. so I have planned to do one-hot-encoding with handle_unknown='ignore' . However, my problem is that I am afraid …
Category: Data Science

Categorical Variable Embedding

I have a categorical variable in my labeled dataset. I trained one-hot encoded version of it in another neural network having embedding layer with a larger labeled dataset. I have obtained the weights of embedding layer. Is it possible to use embedding layer weights as a categorical variable representation like one-hot-encoding while using it in another network which has no embedding layer? For example, One-hot-encoded variable, A B C D D 0 0 0 1 B 0 1 0 0 …
Category: Data Science

How to do target encoding when data has repeated rows?

How can I do encoding for a category when data has repeated rows? Can I do target encoding? Or Is there another encoding I can use? I want to figure how to include a categorical variable in a model to predict a numerical variable Y. Because I am working with some legislative data, my challenge is my category code is over 4000 unique values, those values that cannot be easily grouped(*), and they can have repeats. In fact, anecdotally I …
Category: Data Science

What is the best practice to normalize/standardize imbalanced data for outlier detection or binary classification task?

I'm researching Anomaly/outlier/fraud detection, and I'm looking for the best practice to pre-process the synthetic data for imbalanced data. I have checked all methodology for normalizing/standardizing, which are not sensitive to the presence of outliers and fit this case study. Based on scikit-learn 0.24.2 study about Compare the effect of different scalers on data with outliers, it has been stated here: If some outliers are present in the set, robust scalers or transformers are more appropriate. I'm using CTU-13 dataset, …
Category: Data Science

Should one-hot encoded categorical features needs to be scaled when used along with text feature while deriving semantic similarity?

My aim is to derive textual similarity using multiple features. Some of the features are textual for which I am using (Tfhub 2.0) Universal Sentence encoder. There are other categorical features which are encoded using one-hot encoder. For example, for a single record in my dataset, feature vector looks like this: text feature's embedding is 512 dimension vector - 1 X 512 categorical (non-ordered) feature vector - 1 X 500 (since there are 500 unique values in the feature) my …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.