feature-scaling

Does it make sense to scale input data with random forest regressor taking two different arrays as input?

Jérémy Talbot-Pâquet

2022年6月2日 23:42

I am exploring Random Forests regressors using sklearn by trying to predict the returns of a stock based on the past hour data. I have two inputs: the return (% of change) and the volume of the stock for the last 50 mins. My output is the predicted price for the next 10 minutes. Here is an example of input data: Return Volume 0 0.000420 119.447233 1 -0.001093 86.455629 2 0.000277 117.940777 3 0.000256 38.084008 4 0.001275 74.376315 ... 45 …

Topic: feature-scaling random-forest scikit-learn

Category: Data Science

Is test data required to be transformed by training data statistics?

Apollonia Vitelli

2022年5月30日 13:54

I am using a dataset (from literature) to build an MLP and classify real-world samples (from wetlab experiment) using this MLP. The performance of MLP on the literature dataset are well enough. I am following standard preprocessing procedure, where, after splitting, I firstly standardize my training data with fit_transform and then the testing data with transform so that I ensure I use only training data statistics (mean and std) to standardize unseen data against those mean and std. However, when …

Topic: preprocessing feature-scaling classification python machine-learning

Category: Data Science

Generalize min-max scaling to vectors

Gilad Deutsch

2022年5月18日 15:08

I am combining several vectors, where each vector is a certain kind of embedding of some object. Since each embedding is very different (some have all components between $[0, 1]$ some have components in the range of around 60 or 70 etc.) I want to rescale the vectors before combining them. I thought about using something like min-max rescaling, but I'm not sure how to generalize it to vectors. I could do something of the sort - $\frac{v-|v_{min}|}{|v_{max}|-|v_{min}|)}$ but I …

Topic: embeddings normalization feature-scaling

Category: Data Science

Input standartization for Deep Learning - Proper Scaling

Dimka Kopitkov

2022年5月18日 01:02

Typically the input to neural network (NN) is transformed to have zero mean and 1 std. I wonder why std scale should be 1? What about other scales? 10? 100? Doesn't it make sense to provide NN with input of wider range so that NN can separate different clusters easier and deal with loss function for each cluster in more simple and robust way? Did someone here tried different scales and can share his experience? If answer depends on the …

Topic: feature-scaling deep-learning

Category: Data Science

One-hot encoding with values other than 1

Djura Marinkov

2022年5月13日 18:02

I was thinking if I have an input which has 36 possible values, and I make it as 36 inputs where exactly one of them is non 0, what is optimal value for each of the non 0 inputs? It may be: [1, 0, 0,....,0] [0, 1, 0,....,0] [0, 0, 1,....,0] Or: [36, 0, 0,....,0] [0, 36, 0,....,0] [0, 0, 36,....,0] Or even: [6, 0, 0,....,0] [0, 6, 0,....,0] [0, 0, 6,....,0] In order this feature to have same impact …

Topic: feature-engineering feature-scaling neural-network

Category: Data Science

Do I need to encode numerical variables like "year"?

smarks70

2022年5月9日 11:56

I have a simple time-series dataset. it has a date-time feature column. user,amount,date,job chris, 9500, 05/19/2022, clean chris, 14600, 05/12/2021, clean chris, 67900, 03/27/2021, cooking chris, 495900, 04/25/2021, fixing Using Pandas, I split this column into multiple features like year, month, day. ## Convert Date Coloumn into Date Time type data["date"] = pd.to_datetime(data["date"], errors="coerce") ## Order by User and Date data = data.sort_values(by=["user", "date"]) ## Split Date into Year, Month, Day data["year"] = data["date"].dt.year data["month"] = data["date"].dt.month data["day"] = data["date"].dt.day …

Topic: normalization feature-scaling encoding dataset

Category: Data Science

Normalize data with uneven groups?

Arman Sharma

2022年5月7日 05:02

I have a dataset with 3 independent variables [city, industry, amount] and wish to normalize the amount. But I wish to do it with respect to industry and city. Simply grouping by the city and industry gives me a lot of very sparse groups on which normalizing (min-max, etc.) wouldn't be very meaningful. Is there any better way to normalize it?

Topic: preprocessing feature-scaling machine-learning

Category: Data Science

Standardization in combination with scaling

Caterina

2022年5月6日 05:01

Would it be ok to standardize all the features that exhibit normal distribution (with StandardScaler) and then re-scale all the features in the range 0-1 (with MinMaxScaler). So far I've only seen people doing one OR the other, but not in combination. Why is that? Also, is the Shapiro Wilk Test a good way to test if standardization is advisable? Should all features exhibit a normal distribution or are you allowed to transform only the ones that do have it?

Topic: training normalization preprocessing feature-scaling machine-learning

Category: Data Science

Why does log-transforming the target have a huge impact on MSE value?

Caterina

2022年5月4日 18:56

I am doing linear regression using the Boston Housing data set, and the effect of applying $\log(y)$ has a huge impact on the MSE. Failing to do it gives MSE=34.94 while if $y$ is transformed, it gives 0.05.

Topic: transformation rmse mse feature-scaling

Category: Data Science

Correcting for one of multiple strong batch effects in a dataset

bglbrt

2022年5月2日 19:41

I am wondering which statistical tools to use when analysing data that have multiple strong batch effects (distributions vary from one batch to another). I would like to correct batch effect when it originates from one variable, without taking off the potential batch effect from other variables. If this is unclear, taking a short example is probably the best way to go to explain my problem: Imagine that we have 10 persons taking part in an experiment. The experiment is …

Topic: normalization preprocessing feature-scaling regression

Category: Data Science

Feature selection before or after scaling and splitting

Caterina

2022年5月2日 15:30

Should feature scaling/standardization/normalization be done before or after feature selection, and before or after data splitting? I am confused about the order in which the various pre-processing steps should be done

Topic: machine-learning-model preprocessing feature-scaling feature-selection

Category: Data Science

Is it better to use a MinMax or a Log Return normalization to predict stock price movements?

Vincent Roye

2022年4月29日 21:04

I am trying to use a LSTM model to predict d+2 and d+3 closing prices. I am not sure whether I should normalize the data with a MixMax scaler (-1,+1) using the log return (P(n)-P(0))/P(0) for each sample I have tried quite a lot of source code from Github and they don't seem to converge on any technique.

Topic: lstm normalization feature-scaling time-series

Category: Data Science

sklearn MinMaxScaler: Inverse does not equal original

Bochra BEN JABALLAH

2022年4月26日 11:06

I am using MinMaxScaler on a large dataset (2201887, 3) to normalize features. Inversed values does not match originals. I tested with the target column, first (a), I applied the scaler on 10 values, then did the inverse transformation and I was able to get original values. Then (b), I inverted 10 normalized values after applying MinMaxScaler on the whole column and results were completely different : Result of (a) : Result of (b) : How can I have the …

Topic: lstm normalization feature-scaling deep-learning neural-network

Category: Data Science

Shall I use ordinal encoding or One-Hot-Encoding when using DBSCAN for content clustering on websites?

jochen6677

2022年4月24日 06:01

I want to cluster the preparation steps on cooking recipes websites in one cluster so I can distinguish them from the rest of the website. To achieve this I extracted for each text node of the website the DOM path (e.g. body->div->div->table->tr ....) and did a One-Hot-Encoding before I executed the DBSCAN clustering algorithm. My hope was, that the DBSCAN algorithm recognizes also not only 100% identical DOM-paths as 1 common cluster, because sometimes one preparation step is e.g. in …

Topic: one-hot-encoding feature-engineering feature-scaling dbscan feature-selection

Category: Data Science

Should outliers be removed only from the target variable or from any variable where they are found?

letdatado

2022年4月17日 04:04

What I often do is that I check boxplots and histograms for target/dependent variable and after much caution, treat/remove the outliers. But this is what I do only for the target variable. I.e., if considered the removal, I'd simply drop the entire row where my target value was found outlying. Suppose if I am having outliers in some independent variables as well. What should I do there? Either, Should I ignore them? Or, Should I take the same approach with …

Topic: feature-scaling outlier statistics data-cleaning machine-learning

Category: Data Science

K-Fold cross validation and data leakage

2022年4月15日 00:01

I want to do K-Fold cross validation and also I want to do normalization or feature scaling for each fold. So let's say we have k folds. At each step we take one fold as validation set and the remaining k-1 folds as training set. Now I want to do feature scaling and data imputation on that training set and then apply the same transformation on that validation set. I want to do this for each step. I am trying …

Topic: data-leakage data-imputation feature-scaling cross-validation

Category: Data Science

what is correct way to perform normalization on data in Auto encoder?

Milan_Harkhani

2022年4月14日 04:05

working on anomaly detection problem. i'm using auto-encoder to denoise given input. I trained network with normal data(anomaly free). so model predict normal state of given input. Normalization of input is essential for my dataset. problem with normalization is that when noise value is very high compare to entire dataset. then prediction follows noise. for example if I add noise (delta=300) to 80% of the data and perform normalization on the dataset which mean value is 250 and standard deviation …

Topic: noise autoencoder preprocessing feature-scaling deep-learning

Category: Data Science

Is it wise to always `StandardScaler()` features? [SOLVED]

jake_asks_short_questions

2022年4月8日 17:38

My current investigations point to the sklearn.preprocessing.StandardScaler() not always being the right choice for certain types of feature extractions for neural networks. Suppose I want to classify sound events based on spectrogram data. A second of such data could look like this: Visible here is a sine wave of around 1kHz over one second. The settling of the low bands is specific to the feature extraction and not part of the question. The data is a (n,28,40) matrix of dBFS-values, …

Topic: feature-scaling feature-extraction python

Category: Data Science

Python is reading my data with NANS and Infs, but they don't have any

Lasmyr

2022年4月7日 19:08

I'm having an issue in Python where it says that the dataframe I have loaded through pandas.read_csv() cannot be scaled using StandardScaler()because of the presence of Inf values or values too big for dtype(float) I checked in R (I am more comfortable with this language, but have to use Python for this project) and it shows that the dataframe does not have any Inf values or NAs. Are there restrictions in Python that converts large numbers or small numbers to …

Topic: data feature-scaling dataset data-cleaning machine-learning

Category: Data Science

Data scaling for large dynamic range in neural networks

Rishabh Jain

2022年4月7日 18:06

The usual strategy in neural networks today is to use min-max scaling to scale the input feature vector from 0 to 1. I want to know if the same principle holds true if our inputs have a large dynamic range (for example, there may be some very large values and some very small values). Isn't it better to use logarithmic scaling in such cases?

Topic: preprocessing feature-scaling neural-network

Category: Data Science

About