Does it make sense to scale input data with random forest regressor taking two different arrays as input?

I am exploring Random Forests regressors using sklearn by trying to predict the returns of a stock based on the past hour data. I have two inputs: the return (% of change) and the volume of the stock for the last 50 mins. My output is the predicted price for the next 10 minutes. Here is an example of input data: Return Volume 0 0.000420 119.447233 1 -0.001093 86.455629 2 0.000277 117.940777 3 0.000256 38.084008 4 0.001275 74.376315 ... 45 …
Category: Data Science

Is test data required to be transformed by training data statistics?

I am using a dataset (from literature) to build an MLP and classify real-world samples (from wetlab experiment) using this MLP. The performance of MLP on the literature dataset are well enough. I am following standard preprocessing procedure, where, after splitting, I firstly standardize my training data with fit_transform and then the testing data with transform so that I ensure I use only training data statistics (mean and std) to standardize unseen data against those mean and std. However, when …
Category: Data Science

Generalize min-max scaling to vectors

I am combining several vectors, where each vector is a certain kind of embedding of some object. Since each embedding is very different (some have all components between $[0, 1]$ some have components in the range of around 60 or 70 etc.) I want to rescale the vectors before combining them. I thought about using something like min-max rescaling, but I'm not sure how to generalize it to vectors. I could do something of the sort - $\frac{v-|v_{min}|}{|v_{max}|-|v_{min}|)}$ but I …
Category: Data Science

Input standartization for Deep Learning - Proper Scaling

Typically the input to neural network (NN) is transformed to have zero mean and 1 std. I wonder why std scale should be 1? What about other scales? 10? 100? Doesn't it make sense to provide NN with input of wider range so that NN can separate different clusters easier and deal with loss function for each cluster in more simple and robust way? Did someone here tried different scales and can share his experience? If answer depends on the …
Category: Data Science

One-hot encoding with values other than 1

I was thinking if I have an input which has 36 possible values, and I make it as 36 inputs where exactly one of them is non 0, what is optimal value for each of the non 0 inputs? It may be: [1, 0, 0,....,0] [0, 1, 0,....,0] [0, 0, 1,....,0] Or: [36, 0, 0,....,0] [0, 36, 0,....,0] [0, 0, 36,....,0] Or even: [6, 0, 0,....,0] [0, 6, 0,....,0] [0, 0, 6,....,0] In order this feature to have same impact …
Category: Data Science

Do I need to encode numerical variables like "year"?

I have a simple time-series dataset. it has a date-time feature column. user,amount,date,job chris, 9500, 05/19/2022, clean chris, 14600, 05/12/2021, clean chris, 67900, 03/27/2021, cooking chris, 495900, 04/25/2021, fixing Using Pandas, I split this column into multiple features like year, month, day. ## Convert Date Coloumn into Date Time type data["date"] = pd.to_datetime(data["date"], errors="coerce") ## Order by User and Date data = data.sort_values(by=["user", "date"]) ## Split Date into Year, Month, Day data["year"] = data["date"].dt.year data["month"] = data["date"].dt.month data["day"] = data["date"].dt.day …
Category: Data Science

Normalize data with uneven groups?

I have a dataset with 3 independent variables [city, industry, amount] and wish to normalize the amount. But I wish to do it with respect to industry and city. Simply grouping by the city and industry gives me a lot of very sparse groups on which normalizing (min-max, etc.) wouldn't be very meaningful. Is there any better way to normalize it?
Category: Data Science

Standardization in combination with scaling

Would it be ok to standardize all the features that exhibit normal distribution (with StandardScaler) and then re-scale all the features in the range 0-1 (with MinMaxScaler). So far I've only seen people doing one OR the other, but not in combination. Why is that? Also, is the Shapiro Wilk Test a good way to test if standardization is advisable? Should all features exhibit a normal distribution or are you allowed to transform only the ones that do have it?
Category: Data Science

Correcting for one of multiple strong batch effects in a dataset

I am wondering which statistical tools to use when analysing data that have multiple strong batch effects (distributions vary from one batch to another). I would like to correct batch effect when it originates from one variable, without taking off the potential batch effect from other variables. If this is unclear, taking a short example is probably the best way to go to explain my problem: Imagine that we have 10 persons taking part in an experiment. The experiment is …
Category: Data Science

Is it better to use a MinMax or a Log Return normalization to predict stock price movements?

I am trying to use a LSTM model to predict d+2 and d+3 closing prices. I am not sure whether I should normalize the data with a MixMax scaler (-1,+1) using the log return (P(n)-P(0))/P(0) for each sample I have tried quite a lot of source code from Github and they don't seem to converge on any technique.
Category: Data Science

sklearn MinMaxScaler: Inverse does not equal original

I am using MinMaxScaler on a large dataset (2201887, 3) to normalize features. Inversed values does not match originals. I tested with the target column, first (a), I applied the scaler on 10 values, then did the inverse transformation and I was able to get original values. Then (b), I inverted 10 normalized values after applying MinMaxScaler on the whole column and results were completely different : Result of (a) : Result of (b) : How can I have the …
Category: Data Science

Shall I use ordinal encoding or One-Hot-Encoding when using DBSCAN for content clustering on websites?

I want to cluster the preparation steps on cooking recipes websites in one cluster so I can distinguish them from the rest of the website. To achieve this I extracted for each text node of the website the DOM path (e.g. body->div->div->table->tr ....) and did a One-Hot-Encoding before I executed the DBSCAN clustering algorithm. My hope was, that the DBSCAN algorithm recognizes also not only 100% identical DOM-paths as 1 common cluster, because sometimes one preparation step is e.g. in …
Category: Data Science

Should outliers be removed only from the target variable or from any variable where they are found?

What I often do is that I check boxplots and histograms for target/dependent variable and after much caution, treat/remove the outliers. But this is what I do only for the target variable. I.e., if considered the removal, I'd simply drop the entire row where my target value was found outlying. Suppose if I am having outliers in some independent variables as well. What should I do there? Either, Should I ignore them? Or, Should I take the same approach with …
Category: Data Science

K-Fold cross validation and data leakage

I want to do K-Fold cross validation and also I want to do normalization or feature scaling for each fold. So let's say we have k folds. At each step we take one fold as validation set and the remaining k-1 folds as training set. Now I want to do feature scaling and data imputation on that training set and then apply the same transformation on that validation set. I want to do this for each step. I am trying …
Category: Data Science

what is correct way to perform normalization on data in Auto encoder?

working on anomaly detection problem. i'm using auto-encoder to denoise given input. I trained network with normal data(anomaly free). so model predict normal state of given input. Normalization of input is essential for my dataset. problem with normalization is that when noise value is very high compare to entire dataset. then prediction follows noise. for example if I add noise (delta=300) to 80% of the data and perform normalization on the dataset which mean value is 250 and standard deviation …
Category: Data Science

Is it wise to always `StandardScaler()` features? [SOLVED]

My current investigations point to the sklearn.preprocessing.StandardScaler() not always being the right choice for certain types of feature extractions for neural networks. Suppose I want to classify sound events based on spectrogram data. A second of such data could look like this: Visible here is a sine wave of around 1kHz over one second. The settling of the low bands is specific to the feature extraction and not part of the question. The data is a (n,28,40) matrix of dBFS-values, …
Category: Data Science

Python is reading my data with NANS and Infs, but they don't have any

I'm having an issue in Python where it says that the dataframe I have loaded through pandas.read_csv() cannot be scaled using StandardScaler()because of the presence of Inf values or values too big for dtype(float) I checked in R (I am more comfortable with this language, but have to use Python for this project) and it shows that the dataframe does not have any Inf values or NAs. Are there restrictions in Python that converts large numbers or small numbers to …
Category: Data Science

Data scaling for large dynamic range in neural networks

The usual strategy in neural networks today is to use min-max scaling to scale the input feature vector from 0 to 1. I want to know if the same principle holds true if our inputs have a large dynamic range (for example, there may be some very large values and some very small values). Isn't it better to use logarithmic scaling in such cases?
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.