Is there a way to find relatedness between data and the data obtained after some transformation applied to it? i.e. given a data I need to find the most related data(most of the values in that data can be obtained) that can be found by applying some transformation in original data. I tried but couldn't find a proper answer, most of the discussion that I found is about linear transformation or log transformation but I want to find a way …
I am doing linear regression using the Boston Housing data set, and the effect of applying $\log(y)$ has a huge impact on the MSE. Failing to do it gives MSE=34.94 while if $y$ is transformed, it gives 0.05.
I have used np.log(data) and then applied data.diff() to transform my data in timeseries model. I have the predictions. How do I convert it back to normal scale? Here is an example for your reference: -------------------------------------------------------------------- | sales | np.log(sales) | (np.log(sales)).diff() | predictions | -------------------------------------------------------------------- |166.594019 | 5.115560 | -0.045918 | -0.045918 | -------------------------------------------------------------------- Note: I have provided only one example which from index 2 as the first value after data.diff() will be null. And hence the prediction at …
This is a sister post to the original closed post (here). Since the data transformation part is done after data spliting on the TRAINING data only, I wonder wouldn't such transformation has dependency with how we subsample our data? We can have different transformation results when we pick different portion of training data. But I personally find it hard to convince myself that: isn't data transformation should be as invariant and generalizable as possible, across different subsamplings of dataset? Also, …
Does taking the log of odds bring linearity between the odds of the dependent variable & the independent variables by removing skewness in the data? Is this one reason why we use log of odds in logistic regression? If yes, then is log transformation of data values unnecessary in logistic regression?
I'm a data science student and I've come across a fairly unusual dataset (to me, which explains the vague title). It's of the following form: STAT_1 STAT_2 ... HOME AWAY NEXT_HOME NEXT_AWAY NEXT_RESULT 15 11 ... Team A Team B Team C Team D 1 11 18 ... Team C Team D Team E Team F 0 ... ... ... ... ... ... ... ... 10 11 ... Team W Team X Team Y Team Z 1 Basically, the rows …
I am kind of confuse about this topic of feature engineering. I am trying to make an web app in which people can upload test data as csv. Now I am confuse about how to do feature engineering after deploy the app, especially how to handle outliers and missing value? Suppose I want to change all the outliers of the test data with Q3+(1.5*IQR) value. My confusion is should I use the training dataset's calculated Q3+(1.5*IQR) value to change all …
I am trying to understand transformations but this question seems to be in my and some people's mind. If we have a numeric variable in EVERY data science case. Transforming data(Log, power transforms) into normal distribution will help the model to learn better? And stationarity. Stationarity is a different thing than transforming data to make it have a normal distribution. Is Transforming EVERY numeric data to stationery will make EVERY model learn better too?
I have a categorical variable with 4 levels ('8 c', '6 c','NAN','Others') and I want to convert it to numerical form. an Obvious way is to simply remove the 'c' part from the first two categories and replace NAN with 0. However, I was wondering about the 'Others' level? What could be the best way to transform this level? Please note that the variable represents the number of cylinders for a given car.
I have 1mil gzipped files which contain in total 350mil \n separated json objects. 26GB compressed, ~320GB uncompressed, representing 7 years of data for a multi-tenant application. I want to create one parquet file per tenant per month. tenant_id is a property of each object. All objects have the same structure. There are ~30 properties. Property values can be missing, booleans may be quoted "true" or unquoted true, etc. Many of my attempts failed until now, with all the tools …
At the moment I'm using XGBoost to generate a prediction of probabilities with a custom objective-function to build something like an expert system. To do so I need to transform the raw XGBoost predictions into a probability distribution, where every value lies in the range from 0 to 1 and they all sum up to 1. Naturally you start out with the Softmax transformation. But as it turns out this function has some significant drawbacks for this kind of application. …
I am working on a regression problem where I have a lot of outliers in multiple variables. As far as I can think of, there are 3 things I can do to outliers. Remove them (least attractive option) Transform them (log transformation, box-cox transformation etc) Do nothing and build a model including them My question is regarding the second point. If I want to transform my features using any of the transformations solely for the purpose of outlier, is it …
I am working on a Linear Regression problem and one of the assumptions of a Linear Regression model is that the features should be Normally Distributed. Hence to convert my non linear features to linear, I am performing several transformations like log, box-cox, square-root transformation etc. I have both, discrete and continuous numerical variables (an example of each along with their histograms and qq plot is given): CONTINUOUS VARIABLE HISTOGRAM AND QQ PLOT DISCRETE VARIABLE HISTOGRAM AND QQ PLOT From …
Say you have data with fields named: A, B, C, KEY, VALUE. And lets say the KEY field contains a discrete set of possible values like "X", "Y", and "Z". How do you transform your data with Tableau so that your resulting data has fields: A, B, C, X, Y, Z? Given an original record set of records that have A=a, B=b, C=c: the value for X should be the VALUE from the original record containing A=a, B=b, C=c and …
I am performing EDA on a dataset of Hotel Reservations. Target is Categorical stating if a given customer will cancel the reservation or not. Dataset has 25 features, 30244 entries. I have two features stating the number of adults and the number of babies coming with the person who made the reservation. Number of adults can be 1, 2, 3, 4, or 5. (Range specifically given in dataset description) Number of babies in the train set take values 0, 1, …
I'm working on a side project where I have a mixture of static data and time series, and the goal would be to perform clustering on the data. There's a bunch of data sources, but basically the main thing would be some static information about users (like age, sex, location etc.) and some time series data (user 123 did xyz at 2pm, then yxz at 3pm, then yyy at 4pm). The goal would be to perform a clustering/segmentation via unsupervised …
I'm looking for tools to characterize relationships between gridded outputs of multiple physical models as image distortions. For instance, given a 2-d picture of the temperature distribution in two rooms, one might characterize it by a contraction of an upper layer of warm air: The inverse problem I am interested in is inferring this contraction using the two fields as inputs. I understand that this may often be an underdetermined problem and am prepared to regularize as necessary by imposing, …
My end goal is to visualize some data using a violin plot or something similar using Python. I have the following data in a file (test.csv). The first column is a list of species. The other columns determine abundance of the species at a certain latitude (e.g. how abundant is species A at altitude 1000, 2000?). (Ignoring units for now.) How can I plot this as a violin plot (or something similar)? test.csv species,1000,2000,3000,4000,5000,6000,7000 species_A,0.5,0.5,,,2,1,2 species_B,0.5,1,0.5,0.5,1,1,10 species_C,1,1,10,3,15,4,5 species_D,15,3,2,1,0.5,1,3 The Python …
I have a variable with a skewed distribution. I applied BoxCox transformation and now the variable follows a Gaussian distribution. But, as seen in the image below in the boxplot, outliers still exist. My question is: Although after transformation, the variable distribution is nearly Gaussian, if there are still outliers, should we still select this transformation? Or should we decide to use other techniques such as discretization in order to capture all outliers?
I am aiming to assess the effect of BMI (continuous) on certain biomarkers (also continuous) whilst adjusting for several relevant variables (mixed categorical and continuous) using multiple regression. My data is non-normal which I believe violates one of the key assumptions of multiple linear regression. Whilst I think it can still be performed I think it affects significance testing which is an issue for me. I think I can transform the data and then perform regression but I'm not sure …