preprocessing

How to preprocess an ordered categorical variable to feed a machine learning algorithm?

marcus

2022年6月4日 22:00

I have a categorical variable that measures the income of a family: A: no income B: Up to $500 C: $500-$700 … P: $5000-$6000 Q: More than \\\$6000 It seems odd to me that I have to get dummies for this variable, since it's ordered. I wonder if it's better to map the values: {'A': 0, 'B': 1, …, 'Q': 17} so I can input it into the algorithm this values as integer numbers. What's the proper way of preprocessing …

Topic: data-wrangling preprocessing dataset machine-learning

Category: Data Science

How to deal with name strings in large data sets for ML?

Danny Abstemio

2022年6月1日 23:04

My data set contains multiple columns with first name, last name, etc. I want to use a classifier model such as Isolation Forest later. Some word embedding techniques were used for longer text sequences preferably, not for single-word strings as in this case. So I think these techniques wouldn't be the way that will work correctly. Additionally Label encoding or Label binarization may not be suitable ways to work with names, beacause of many different values on the on side …

Topic: preprocessing classifier encoding nlp python

Category: Data Science

Is test data required to be transformed by training data statistics?

Apollonia Vitelli

2022年5月30日 13:54

I am using a dataset (from literature) to build an MLP and classify real-world samples (from wetlab experiment) using this MLP. The performance of MLP on the literature dataset are well enough. I am following standard preprocessing procedure, where, after splitting, I firstly standardize my training data with fit_transform and then the testing data with transform so that I ensure I use only training data statistics (mean and std) to standardize unseen data against those mean and std. However, when …

Topic: preprocessing feature-scaling classification python machine-learning

Category: Data Science

Preprocessing in TensorFlow

Daniel

2022年5月30日 08:36

Good night, I am working on a paper comparing Python libraries for machine learning and deep learning. Trying to evaluate Keras and TensorFlow separately, I'm looking for information about TensorFlow methods or functions that can be used to preprocess datasets, such as those included in scikit-learn (sklearn.preprocessing) or the Keras preprocessing layers, but I can't find anything beyond a one hot enconding for labels... Does anyone know if what I am looking for exists? Thank you very much!

Topic: tensorflow preprocessing python

Category: Data Science

Is it good practice to include data cleaning or feature engineering steps in an sklearn pipeline to create a scalable pipeline?

LazyEval

2022年5月30日 00:03

I am working on implementing a scalable pipeline for cleaning my data and pre-processing it before modeling. I am pretty comfortable with the sklearn Pipeline object that I use for pre-processing but I am not sure if I should include data cleaning, data extraction and feature engineering steps that are typically more specific to the dataset I am working on. My general thinking is that the pre-processing phase would include operations on the data that need to be done after …

Topic: pipelines preprocessing scikit-learn python data-cleaning

Category: Data Science

What is the suggested way to create features (Mel-Spectograms) from speech signal for classification with ResNet?

3r1c

2022年5月29日 16:04

At the moment I have this piece of code which cuts a Spectogram into fixed length tensors: def chunks(l, n): """Yield successive n-sized chunks from l.""" for i in range(0, len(l[0][0]), n): if(i+n < len(l[0][0])): yield X_sample.narrow(2, i, n) The following piece of code downsamples the Audio Creates Mel_Spectograms and takes the log of it Applies a Cepstral Mean and Variance Normalization Then it cuts the spectogram with the code above into a fixed size of length and appends it …

Topic: audio-recognition preprocessing feature-extraction python

Category: Data Science

Pre-process data images before training OneClassSVM and decrease number of features

riadrifai

2022年5月25日 13:07

I want to train a OneClassSVM() using sklearn, and I have a set of around 800 images in my training set. I am using opencv to read the images and resize them to constant dimensions (960x540) and then adding them to a numpy-array. The images are RGB and thus have 3-dimensions. For that, I am reshaping the numpy array after reading all the images: #Assume X is my numpy array which contains all the images before reshaping #Now I reshape …

Topic: numpy preprocessing scikit-learn python machine-learning

Category: Data Science

Different approaches of creating the test set

James K J

2022年5月23日 14:01

I came across different approaches to creating a test set. Theoretically, it's quite simple, just pick some instances randomly, typically 20% of the dataset and set them aside. Below are the approaches The naive way of creating the test set is def split_train_test(data,test_set_ratio): #create indices shuffled_indices = np.random.permutation(len(data)) test_set_size = int(len(data) * test_set_ratio) test_set_indices = shuffled_indices[:test_set_size] train_set_indices = shuffled_indices[test_set_size:] return data.iloc[train_set_indices],data.iloc[test_set_indices] The above splitting mechanism works, but if the program is run, again and again, it will generate a different …

Topic: numpy preprocessing python machine-learning

Category: Data Science

How to extract and classify data from a column in excel?

Arjun Arora

2022年5月23日 03:07

I have a column in an Excel sheet that contains a lot of data separated by || delimiters. The data can be classified to some classes like Entity, IFSC codes, transaction reference id, etc. A single cell looks like this: EFT INCOMING||0141201||NHFI0141201||UTR||SBIN118121948660 M S||some-name ||some-purpose||TRN REF NO:a1b2c3d4e5 Not every cell has the same number of classes or even the same type of classes. Another example: COMM/CHARGES/FEES||CHECK/REF.6546644473||BILPAY CCTY BEARING C||00.00||00012||18031358||BLPY||TRN REF NO:a1b2c3d4e5 I tried extracting this information using regular expressions and …

Topic: text preprocessing named-entity-recognition classification python

Category: Data Science

What can help decrease outliers' influence on non-tree models?

Revolucion for Monica

2022年5月20日 03:05

I have a feature with all the values between 0 and 1 except few outliers larger than 1. I am trying to collect all the methods that can help to decrease outliers' influence on non-tree models: StandardScaler Apply rank transform to the features Apply np.log1p(x) transform to the data MinMaxScaler Winsorization I wasn't able to imagine any other ... I guess that's all?

Topic: preprocessing ranking outlier

Category: Data Science

Predicting a signal based on other signals

timeSeriesNoob

2022年5月18日 17:06

I want to predict a signal based on other related signals, how would I go about doing this? My current approach is to do some feature extraction (in the time and frequency domain) on both the ground truth signal and on the input signals. I use the features that I calculated on my input signals to predict the ground truth signal with basic regression models such as RandomForestRegressor or GradientBoostingRegressor models. I've used a rolling window approach with varying step/window …

Topic: gan preprocessing regression time-series machine-learning

Category: Data Science

LSTM - How to prepare train from a dataset which contains multiple observations for different events

Khan9797

2022年5月17日 00:06

I m using LSTM in a project related to MobiFall dataset which contains falls and daily activitives - such as walking, sitting etc - sensed by accelerometer, gyroscope and orientation sensors in x,y,z axes. So I need to modify LSTM into multi-variate form. How could it be done? And after this problem is solved, I have to deal with another, there are multiple time-series events in different files which were done by different people. For example, I have got ADL_1_walking_1_.txt, …

Topic: lstm preprocessing deep-learning time-series dataset

Category: Data Science

Scaling and handling highly correlated features in tabular data for regression

hAcKnRoCk

2022年5月14日 07:38

I am working on a regression problem trying to predict a target variable with seven predictor variables. I have a tabular dataset of 1400 rows. Before delving into the machine learning to build a predictor, I did an EDA(exploratory data analysis) and I got the below correlation coefficients (Pearson r) in my data. Note that I have included both the numerical predictor variables and the target variable. I am wondering about the the following questions: We see that pv3 is …

Topic: pearsons-correlation-coefficient preprocessing regression data-cleaning data-mining

Category: Data Science

Should I remove the trend from timeseries when using DeepAR

GileBrt

2022年5月12日 15:02

I saw that for some other algorithms for timeseries data it is advised to remove trend and seasonality before doing the prediction (ex: ARIMA and LSTM) I figured out from the paper that SageMaker's DeepAR deals internally with seasonality, but does the same thing stands for trend? Let's say I have multiple timeseries, where some of them have positive, and some have negative trend. Should I remove trend and then use DeepAR prediction, or should I just ignore it and …

Topic: rnn preprocessing aws machine-learning

Category: Data Science

BertTokenizer on custom data returns same index for all tokens

lazarea

2022年5月11日 13:53

I'm trying to train Bert tokenizer on a custom dataset but when running tokenizer.tokenize on sample data, it returns the same index for every tokens which is clearly not what is expected. Running bert_vocab_from_dataset on the below sample dataset returns a vocabulary of 88 tokens long. After saving this and reusing it in tensorflow_text.BertTokenizer, I get [88] for all the tokens of the provided two test sentences. Fully reproducible example code: import tensorflow as tf import tensorflow_text from pathlib import …

Topic: bert transformer tokenization preprocessing nlp

Category: Data Science

How to treat Compass data in random forest regression

brokenfulcrum

2022年5月9日 03:06

I'm working on a project where two of the features are entryHeading and exitHeading. Both state the direction (N, NE, E, SE, S, SW, W) of a vehicle at multiple points. My question is how would i go about pre-processing this? My first thought would be to circularize it like I would a 24 hour period but I'm not sure I should go about it in the same way. The data will eventually be used to train a Random forest …

Topic: preprocessing

Category: Data Science

Normalize data with uneven groups?

Arman Sharma

2022年5月7日 05:02

I have a dataset with 3 independent variables [city, industry, amount] and wish to normalize the amount. But I wish to do it with respect to industry and city. Simply grouping by the city and industry gives me a lot of very sparse groups on which normalizing (min-max, etc.) wouldn't be very meaningful. Is there any better way to normalize it?

Topic: preprocessing feature-scaling machine-learning

Category: Data Science

Standardization in combination with scaling

Caterina

2022年5月6日 05:01

Would it be ok to standardize all the features that exhibit normal distribution (with StandardScaler) and then re-scale all the features in the range 0-1 (with MinMaxScaler). So far I've only seen people doing one OR the other, but not in combination. Why is that? Also, is the Shapiro Wilk Test a good way to test if standardization is advisable? Should all features exhibit a normal distribution or are you allowed to transform only the ones that do have it?

Topic: training normalization preprocessing feature-scaling machine-learning

Category: Data Science

Categorical data preprocessing for training a algorithm

spd

2022年5月6日 04:42

I have a training dataset where values of "Output" col is dependent on three columns (which are categorical [No ordering]). Inp1 Inp2 Inp3 Output A,B,C AI,UI,JI Apple,Bat,Dog Animals L,M,N LI,DO,LI Lawn, Moon, Noon Noun X,Y,Z LI,AI,UI Xmas,Yemen,Zombie Extras So, based on this training data, I need a ML Algorithm to predict any incoming data row such that if it is Similar to training rows highest similar output aassigned. The rows can go on increasing (hence get_dummies is creating a lot …

Topic: python-3.x prediction preprocessing categorical-data machine-learning

Category: Data Science

Data preprocessing methods

canP

2022年5月6日 04:11

Data Cleaning Data Imbalance solving (Classification) Data Smoothing (decreasing noise) Creating-deleting features from original data Data Transformation (Box-cox,Log Transform) Making Dataset stationery (time series) And other specific data preprocessing methods in NLP-Computer Vision (very specific ones) I am trying to research data preparation methods and so far those are the things i could find. Do you think is anything missing? Thanks.

Topic: data preprocessing data-cleaning data-mining machine-learning

Category: Data Science

About