I am exploring Random Forests regressors using sklearn by trying to predict the returns of a stock based on the past hour data. I have two inputs: the return (% of change) and the volume of the stock for the last 50 mins. My output is the predicted price for the next 10 minutes. Here is an example of input data: Return Volume 0 0.000420 119.447233 1 -0.001093 86.455629 2 0.000277 117.940777 3 0.000256 38.084008 4 0.001275 74.376315 ... 45 …
I am using a dataset (from literature) to build an MLP and classify real-world samples (from wetlab experiment) using this MLP. The performance of MLP on the literature dataset are well enough. I am following standard preprocessing procedure, where, after splitting, I firstly standardize my training data with fit_transform and then the testing data with transform so that I ensure I use only training data statistics (mean and std) to standardize unseen data against those mean and std. However, when …
I am combining several vectors, where each vector is a certain kind of embedding of some object. Since each embedding is very different (some have all components between $[0, 1]$ some have components in the range of around 60 or 70 etc.) I want to rescale the vectors before combining them. I thought about using something like min-max rescaling, but I'm not sure how to generalize it to vectors. I could do something of the sort - $\frac{v-|v_{min}|}{|v_{max}|-|v_{min}|)}$ but I …
Typically the input to neural network (NN) is transformed to have zero mean and 1 std. I wonder why std scale should be 1? What about other scales? 10? 100? Doesn't it make sense to provide NN with input of wider range so that NN can separate different clusters easier and deal with loss function for each cluster in more simple and robust way? Did someone here tried different scales and can share his experience? If answer depends on the …
I was thinking if I have an input which has 36 possible values, and I make it as 36 inputs where exactly one of them is non 0, what is optimal value for each of the non 0 inputs? It may be: [1, 0, 0,....,0] [0, 1, 0,....,0] [0, 0, 1,....,0] Or: [36, 0, 0,....,0] [0, 36, 0,....,0] [0, 0, 36,....,0] Or even: [6, 0, 0,....,0] [0, 6, 0,....,0] [0, 0, 6,....,0] In order this feature to have same impact …
I have a simple time-series dataset. it has a date-time feature column. user,amount,date,job chris, 9500, 05/19/2022, clean chris, 14600, 05/12/2021, clean chris, 67900, 03/27/2021, cooking chris, 495900, 04/25/2021, fixing Using Pandas, I split this column into multiple features like year, month, day. ## Convert Date Coloumn into Date Time type data["date"] = pd.to_datetime(data["date"], errors="coerce") ## Order by User and Date data = data.sort_values(by=["user", "date"]) ## Split Date into Year, Month, Day data["year"] = data["date"].dt.year data["month"] = data["date"].dt.month data["day"] = data["date"].dt.day …
I have a dataset with 3 independent variables [city, industry, amount] and wish to normalize the amount. But I wish to do it with respect to industry and city. Simply grouping by the city and industry gives me a lot of very sparse groups on which normalizing (min-max, etc.) wouldn't be very meaningful. Is there any better way to normalize it?
Would it be ok to standardize all the features that exhibit normal distribution (with StandardScaler) and then re-scale all the features in the range 0-1 (with MinMaxScaler). So far I've only seen people doing one OR the other, but not in combination. Why is that? Also, is the Shapiro Wilk Test a good way to test if standardization is advisable? Should all features exhibit a normal distribution or are you allowed to transform only the ones that do have it?
I am doing linear regression using the Boston Housing data set, and the effect of applying $\log(y)$ has a huge impact on the MSE. Failing to do it gives MSE=34.94 while if $y$ is transformed, it gives 0.05.
I am wondering which statistical tools to use when analysing data that have multiple strong batch effects (distributions vary from one batch to another). I would like to correct batch effect when it originates from one variable, without taking off the potential batch effect from other variables. If this is unclear, taking a short example is probably the best way to go to explain my problem: Imagine that we have 10 persons taking part in an experiment. The experiment is …
Should feature scaling/standardization/normalization be done before or after feature selection, and before or after data splitting? I am confused about the order in which the various pre-processing steps should be done
I am trying to use a LSTM model to predict d+2 and d+3 closing prices. I am not sure whether I should normalize the data with a MixMax scaler (-1,+1) using the log return (P(n)-P(0))/P(0) for each sample I have tried quite a lot of source code from Github and they don't seem to converge on any technique.
I am using MinMaxScaler on a large dataset (2201887, 3) to normalize features. Inversed values does not match originals. I tested with the target column, first (a), I applied the scaler on 10 values, then did the inverse transformation and I was able to get original values. Then (b), I inverted 10 normalized values after applying MinMaxScaler on the whole column and results were completely different : Result of (a) : Result of (b) : How can I have the …
I want to cluster the preparation steps on cooking recipes websites in one cluster so I can distinguish them from the rest of the website. To achieve this I extracted for each text node of the website the DOM path (e.g. body->div->div->table->tr ....) and did a One-Hot-Encoding before I executed the DBSCAN clustering algorithm. My hope was, that the DBSCAN algorithm recognizes also not only 100% identical DOM-paths as 1 common cluster, because sometimes one preparation step is e.g. in …
What I often do is that I check boxplots and histograms for target/dependent variable and after much caution, treat/remove the outliers. But this is what I do only for the target variable. I.e., if considered the removal, I'd simply drop the entire row where my target value was found outlying. Suppose if I am having outliers in some independent variables as well. What should I do there? Either, Should I ignore them? Or, Should I take the same approach with …
I want to do K-Fold cross validation and also I want to do normalization or feature scaling for each fold. So let's say we have k folds. At each step we take one fold as validation set and the remaining k-1 folds as training set. Now I want to do feature scaling and data imputation on that training set and then apply the same transformation on that validation set. I want to do this for each step. I am trying …
working on anomaly detection problem. i'm using auto-encoder to denoise given input. I trained network with normal data(anomaly free). so model predict normal state of given input. Normalization of input is essential for my dataset. problem with normalization is that when noise value is very high compare to entire dataset. then prediction follows noise. for example if I add noise (delta=300) to 80% of the data and perform normalization on the dataset which mean value is 250 and standard deviation …
My current investigations point to the sklearn.preprocessing.StandardScaler() not always being the right choice for certain types of feature extractions for neural networks. Suppose I want to classify sound events based on spectrogram data. A second of such data could look like this: Visible here is a sine wave of around 1kHz over one second. The settling of the low bands is specific to the feature extraction and not part of the question. The data is a (n,28,40) matrix of dBFS-values, …
I'm having an issue in Python where it says that the dataframe I have loaded through pandas.read_csv() cannot be scaled using StandardScaler()because of the presence of Inf values or values too big for dtype(float) I checked in R (I am more comfortable with this language, but have to use Python for this project) and it shows that the dataframe does not have any Inf values or NAs. Are there restrictions in Python that converts large numbers or small numbers to …
The usual strategy in neural networks today is to use min-max scaling to scale the input feature vector from 0 to 1. I want to know if the same principle holds true if our inputs have a large dynamic range (for example, there may be some very large values and some very small values). Isn't it better to use logarithmic scaling in such cases?