How to preprocess an ordered categorical variable to feed a machine learning algorithm?

I have a categorical variable that measures the income of a family: A: no income B: Up to $500 C: $500-$700 … P: $5000-$6000 Q: More than \\\$6000 It seems odd to me that I have to get dummies for this variable, since it's ordered. I wonder if it's better to map the values: {'A': 0, 'B': 1, …, 'Q': 17} so I can input it into the algorithm this values as integer numbers. What's the proper way of preprocessing …
Category: Data Science

How to deal with name strings in large data sets for ML?

My data set contains multiple columns with first name, last name, etc. I want to use a classifier model such as Isolation Forest later. Some word embedding techniques were used for longer text sequences preferably, not for single-word strings as in this case. So I think these techniques wouldn't be the way that will work correctly. Additionally Label encoding or Label binarization may not be suitable ways to work with names, beacause of many different values on the on side …
Category: Data Science

Is test data required to be transformed by training data statistics?

I am using a dataset (from literature) to build an MLP and classify real-world samples (from wetlab experiment) using this MLP. The performance of MLP on the literature dataset are well enough. I am following standard preprocessing procedure, where, after splitting, I firstly standardize my training data with fit_transform and then the testing data with transform so that I ensure I use only training data statistics (mean and std) to standardize unseen data against those mean and std. However, when …
Category: Data Science

Preprocessing in TensorFlow

Good night, I am working on a paper comparing Python libraries for machine learning and deep learning. Trying to evaluate Keras and TensorFlow separately, I'm looking for information about TensorFlow methods or functions that can be used to preprocess datasets, such as those included in scikit-learn (sklearn.preprocessing) or the Keras preprocessing layers, but I can't find anything beyond a one hot enconding for labels... Does anyone know if what I am looking for exists? Thank you very much!
Category: Data Science

Is it good practice to include data cleaning or feature engineering steps in an sklearn pipeline to create a scalable pipeline?

I am working on implementing a scalable pipeline for cleaning my data and pre-processing it before modeling. I am pretty comfortable with the sklearn Pipeline object that I use for pre-processing but I am not sure if I should include data cleaning, data extraction and feature engineering steps that are typically more specific to the dataset I am working on. My general thinking is that the pre-processing phase would include operations on the data that need to be done after …
Category: Data Science

What is the suggested way to create features (Mel-Spectograms) from speech signal for classification with ResNet?

At the moment I have this piece of code which cuts a Spectogram into fixed length tensors: def chunks(l, n): """Yield successive n-sized chunks from l.""" for i in range(0, len(l[0][0]), n): if(i+n < len(l[0][0])): yield X_sample.narrow(2, i, n) The following piece of code downsamples the Audio Creates Mel_Spectograms and takes the log of it Applies a Cepstral Mean and Variance Normalization Then it cuts the spectogram with the code above into a fixed size of length and appends it …
Category: Data Science

Pre-process data images before training OneClassSVM and decrease number of features

I want to train a OneClassSVM() using sklearn, and I have a set of around 800 images in my training set. I am using opencv to read the images and resize them to constant dimensions (960x540) and then adding them to a numpy-array. The images are RGB and thus have 3-dimensions. For that, I am reshaping the numpy array after reading all the images: #Assume X is my numpy array which contains all the images before reshaping #Now I reshape …
Category: Data Science

Different approaches of creating the test set

I came across different approaches to creating a test set. Theoretically, it's quite simple, just pick some instances randomly, typically 20% of the dataset and set them aside. Below are the approaches The naive way of creating the test set is def split_train_test(data,test_set_ratio): #create indices shuffled_indices = np.random.permutation(len(data)) test_set_size = int(len(data) * test_set_ratio) test_set_indices = shuffled_indices[:test_set_size] train_set_indices = shuffled_indices[test_set_size:] return data.iloc[train_set_indices],data.iloc[test_set_indices] The above splitting mechanism works, but if the program is run, again and again, it will generate a different …
Category: Data Science

How to extract and classify data from a column in excel?

I have a column in an Excel sheet that contains a lot of data separated by || delimiters. The data can be classified to some classes like Entity, IFSC codes, transaction reference id, etc. A single cell looks like this: EFT INCOMING||0141201||NHFI0141201||UTR||SBIN118121948660 M S||some-name ||some-purpose||TRN REF NO:a1b2c3d4e5 Not every cell has the same number of classes or even the same type of classes. Another example: COMM/CHARGES/FEES||CHECK/REF.6546644473||BILPAY CCTY BEARING C||00.00||00012||18031358||BLPY||TRN REF NO:a1b2c3d4e5 I tried extracting this information using regular expressions and …
Category: Data Science

What can help decrease outliers' influence on non-tree models?

I have a feature with all the values between 0 and 1 except few outliers larger than 1. I am trying to collect all the methods that can help to decrease outliers' influence on non-tree models: StandardScaler Apply rank transform to the features Apply np.log1p(x) transform to the data MinMaxScaler Winsorization I wasn't able to imagine any other ... I guess that's all?
Category: Data Science

Predicting a signal based on other signals

I want to predict a signal based on other related signals, how would I go about doing this? My current approach is to do some feature extraction (in the time and frequency domain) on both the ground truth signal and on the input signals. I use the features that I calculated on my input signals to predict the ground truth signal with basic regression models such as RandomForestRegressor or GradientBoostingRegressor models. I've used a rolling window approach with varying step/window …
Category: Data Science

LSTM - How to prepare train from a dataset which contains multiple observations for different events

I m using LSTM in a project related to MobiFall dataset which contains falls and daily activitives - such as walking, sitting etc - sensed by accelerometer, gyroscope and orientation sensors in x,y,z axes. So I need to modify LSTM into multi-variate form. How could it be done? And after this problem is solved, I have to deal with another, there are multiple time-series events in different files which were done by different people. For example, I have got ADL_1_walking_1_.txt, …
Category: Data Science

Scaling and handling highly correlated features in tabular data for regression

I am working on a regression problem trying to predict a target variable with seven predictor variables. I have a tabular dataset of 1400 rows. Before delving into the machine learning to build a predictor, I did an EDA(exploratory data analysis) and I got the below correlation coefficients (Pearson r) in my data. Note that I have included both the numerical predictor variables and the target variable. I am wondering about the the following questions: We see that pv3 is …
Category: Data Science

Should I remove the trend from timeseries when using DeepAR

I saw that for some other algorithms for timeseries data it is advised to remove trend and seasonality before doing the prediction (ex: ARIMA and LSTM) I figured out from the paper that SageMaker's DeepAR deals internally with seasonality, but does the same thing stands for trend? Let's say I have multiple timeseries, where some of them have positive, and some have negative trend. Should I remove trend and then use DeepAR prediction, or should I just ignore it and …
Category: Data Science

BertTokenizer on custom data returns same index for all tokens

I'm trying to train Bert tokenizer on a custom dataset but when running tokenizer.tokenize on sample data, it returns the same index for every tokens which is clearly not what is expected. Running bert_vocab_from_dataset on the below sample dataset returns a vocabulary of 88 tokens long. After saving this and reusing it in tensorflow_text.BertTokenizer, I get [88] for all the tokens of the provided two test sentences. Fully reproducible example code: import tensorflow as tf import tensorflow_text from pathlib import …
Category: Data Science

How to treat Compass data in random forest regression

I'm working on a project where two of the features are entryHeading and exitHeading. Both state the direction (N, NE, E, SE, S, SW, W) of a vehicle at multiple points. My question is how would i go about pre-processing this? My first thought would be to circularize it like I would a 24 hour period but I'm not sure I should go about it in the same way. The data will eventually be used to train a Random forest …
Category: Data Science

Normalize data with uneven groups?

I have a dataset with 3 independent variables [city, industry, amount] and wish to normalize the amount. But I wish to do it with respect to industry and city. Simply grouping by the city and industry gives me a lot of very sparse groups on which normalizing (min-max, etc.) wouldn't be very meaningful. Is there any better way to normalize it?
Category: Data Science

Standardization in combination with scaling

Would it be ok to standardize all the features that exhibit normal distribution (with StandardScaler) and then re-scale all the features in the range 0-1 (with MinMaxScaler). So far I've only seen people doing one OR the other, but not in combination. Why is that? Also, is the Shapiro Wilk Test a good way to test if standardization is advisable? Should all features exhibit a normal distribution or are you allowed to transform only the ones that do have it?
Category: Data Science

Categorical data preprocessing for training a algorithm

I have a training dataset where values of "Output" col is dependent on three columns (which are categorical [No ordering]). Inp1 Inp2 Inp3 Output A,B,C AI,UI,JI Apple,Bat,Dog Animals L,M,N LI,DO,LI Lawn, Moon, Noon Noun X,Y,Z LI,AI,UI Xmas,Yemen,Zombie Extras So, based on this training data, I need a ML Algorithm to predict any incoming data row such that if it is Similar to training rows highest similar output aassigned. The rows can go on increasing (hence get_dummies is creating a lot …
Category: Data Science

Data preprocessing methods

Data Cleaning Data Imbalance solving (Classification) Data Smoothing (decreasing noise) Creating-deleting features from original data Data Transformation (Box-cox,Log Transform) Making Dataset stationery (time series) And other specific data preprocessing methods in NLP-Computer Vision (very specific ones) I am trying to research data preparation methods and so far those are the things i could find. Do you think is anything missing? Thanks.
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.