Why my regression model always be dominanted by one feature?

I am working on a financial predict problem. which means it is a time series prediction problem. I have three features, which have high correlation(each two's corr is about 0.6) And I do the linear regression fit. I assume that the coefficient should be similiar among these three features, but i get a coefficient vector like this: [0.01, 0.15, 0.01] which means the second features have the biggest coff(features are normalized), and it can dominant the prediction result. I dont …
Category: Data Science

PCA and orange software

I am analysing if 15 books can be grouped according to 6 variables (of the 15 books, 2 are written by an author, 6 by an other one, and 7 by an other one). I counted the number of occurrences of the variables and I calculated the percentage. Then I used Orange software to use PCA. I uploaded the file. selected the columns and row. And when it comes to PCA the program asks me if I want to normalize …
Category: Data Science

What parameters to use when normalising training, validation, and testing data?

I know a similar post was made here, but I wanted to ask some follow up questions. I am conducting a cross-validation search to find values of a set of hyper-parameters and need to normalise the data. If we split up the data as follows: 'Training' (call this set 'A' for now) and testing data Split the 'training' into training (call this set 'B' for now) and validation sets what parameters should be used when normalising the datasets? Am I …
Category: Data Science

Generalize min-max scaling to vectors

I am combining several vectors, where each vector is a certain kind of embedding of some object. Since each embedding is very different (some have all components between $[0, 1]$ some have components in the range of around 60 or 70 etc.) I want to rescale the vectors before combining them. I thought about using something like min-max rescaling, but I'm not sure how to generalize it to vectors. I could do something of the sort - $\frac{v-|v_{min}|}{|v_{max}|-|v_{min}|)}$ but I …
Category: Data Science

Data normalization in nonstationary data classification with Learn++.NSE based on MLP

I need to predict technical aggregate condition using vibration monitoring data. We consider this data to be nonstationary i.e. distribution parameters and descriptive statistics are not constant. I found that one of the best algorithms for such tasks in Learn++.NSE and we us it with MLP as a base classifier. As I know, it's necessary no normalize data for operations with ANN. We decided to normalize using mean, stdev and sigmoidal function. We train networks of ensemble with sets with …
Category: Data Science

Do I need to encode numerical variables like "year"?

I have a simple time-series dataset. it has a date-time feature column. user,amount,date,job chris, 9500, 05/19/2022, clean chris, 14600, 05/12/2021, clean chris, 67900, 03/27/2021, cooking chris, 495900, 04/25/2021, fixing Using Pandas, I split this column into multiple features like year, month, day. ## Convert Date Coloumn into Date Time type data["date"] = pd.to_datetime(data["date"], errors="coerce") ## Order by User and Date data = data.sort_values(by=["user", "date"]) ## Split Date into Year, Month, Day data["year"] = data["date"].dt.year data["month"] = data["date"].dt.month data["day"] = data["date"].dt.day …
Category: Data Science

Proper iteration over time series data for LSTM neural network

I’m using the supervised learning method with an LSTM network to predict forex prices. To achieve this I’m using deeplearning4j library but I doubt several points of my implementation. I turned off the mini batch feature, then I created many trading indicators from forex data. The point is to provide random chunks of data to the neural network on every epoch and ensure that after every epoch the network state was cleaned. To achieve this I created a dataset iterator …
Category: Data Science

Standardization in combination with scaling

Would it be ok to standardize all the features that exhibit normal distribution (with StandardScaler) and then re-scale all the features in the range 0-1 (with MinMaxScaler). So far I've only seen people doing one OR the other, but not in combination. Why is that? Also, is the Shapiro Wilk Test a good way to test if standardization is advisable? Should all features exhibit a normal distribution or are you allowed to transform only the ones that do have it?
Category: Data Science

Would I be able to combine features on a different unit scale after normalizing?

I'd like to explore some interactions between my variables but they're on different measurement scales. Would for example the absolute value of the difference of them after scaling make sense? From what I understand having them scaled on a 1-0 range would heavily rely on their max and min values, from this assumption it seems to me that interaction within them would not make sense since their position in their own scale would heavily depend on the observation.
Category: Data Science

Normalizing data from same variable but different individuals

I'm new to machine learning. I have the following scenario: I have five individuals that are each carrying an accelerometer. That sensor measures movement/acceleration on a scale from 0 to 255, 0 being no movement, 255 being max movement (at a 5-minutes interval). Some individuals carry sensors that are more sensitive, and some that are less sensitive. As such, some individuals' sensors will provide higher values, and some individuals' sensors will provide lower values, for the same movements. Using a …
Category: Data Science

Correcting for one of multiple strong batch effects in a dataset

I am wondering which statistical tools to use when analysing data that have multiple strong batch effects (distributions vary from one batch to another). I would like to correct batch effect when it originates from one variable, without taking off the potential batch effect from other variables. If this is unclear, taking a short example is probably the best way to go to explain my problem: Imagine that we have 10 persons taking part in an experiment. The experiment is …
Category: Data Science

Is it better to use a MinMax or a Log Return normalization to predict stock price movements?

I am trying to use a LSTM model to predict d+2 and d+3 closing prices. I am not sure whether I should normalize the data with a MixMax scaler (-1,+1) using the log return (P(n)-P(0))/P(0) for each sample I have tried quite a lot of source code from Github and they don't seem to converge on any technique.
Category: Data Science

Normalize data from different groups

I have data that has been grouped into 27 groups by different criteria. The reason for these groupings is to show that each group has different behavior. However, I would like to normalize everything to the same scale. For example, I would like to normalize to a 0-1 scale of 0-100, that way I could say something like $43^{rd}$ percentile and it would have the same meaning across groups. If I were to just, say, standardize each individually by subtracting …
Category: Data Science

How to deal with data having 0 values in many columns?

I am trying to implement logistic regression but the dataset that I have have many columns with skewed data and most of them have 0 as values. I also the skewness of data for many columns its going above 190. But it's not only for training data, it's the same for testing data too. I tried using log method to remove skewness but because most of the value is 0 it messed up my data. I don't know how to …
Category: Data Science

sklearn MinMaxScaler: Inverse does not equal original

I am using MinMaxScaler on a large dataset (2201887, 3) to normalize features. Inversed values does not match originals. I tested with the target column, first (a), I applied the scaler on 10 values, then did the inverse transformation and I was able to get original values. Then (b), I inverted 10 normalized values after applying MinMaxScaler on the whole column and results were completely different : Result of (a) : Result of (b) : How can I have the …
Category: Data Science

How to normalize test data according to the training data if the normalization on the training data is performed row wise?

I read in several places about the normalization of features in the machine learning method. But I normalize my training data row-wise as shown in the following code. I showed only two samples of training data. My question is that while performing the normalization on test data, should I choose the minimum and maximum value of each test sample to normalize each test data, or should I uses the minimum and maximum values from the training data? As an explanation …
Category: Data Science

How to save pixels after normalization

I want to normalize my images and use them in the training. But I couldn't find a way to save images after making changes below...How can I save it? files = ["/content/drive/MyDrive/Colab Notebooks/images/evre1/xyz.png", "/content/drive/MyDrive/Colab Notebooks/images/evre1/xty.png"] def normalize(files): for i in files: image = Image.open(i) new_image =image.resize((224,224)) pixels = asarray(image) # convert from integers to floats pixels = pixels.astype('float32') # calculate global mean and standard deviation mean, std = pixels.mean(), pixels.std() # print('Mean: %.3f, Standard Deviation: %.3f' % (mean, std)) # …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.