transformation

Finding data with transformation applied

okok

2022年5月18日 02:35

Is there a way to find relatedness between data and the data obtained after some transformation applied to it? i.e. given a data I need to find the most related data(most of the values in that data can be obtained) that can be found by applying some transformation in original data. I tried but couldn't find a proper answer, most of the discussion that I found is about linear transformation or log transformation but I want to find a way …

Topic: transformation data-mining

Category: Data Science

Why does log-transforming the target have a huge impact on MSE value?

Caterina

2022年5月4日 18:56

I am doing linear regression using the Boston Housing data set, and the effect of applying $\log(y)$ has a huge impact on the MSE. Failing to do it gives MSE=34.94 while if $y$ is transformed, it gives 0.05.

Topic: transformation rmse mse feature-scaling

Category: Data Science

How to revert np.log(data) and data.diff()?

Sandhya Indurkar

2022年4月15日 07:05

I have used np.log(data) and then applied data.diff() to transform my data in timeseries model. I have the predictions. How do I convert it back to normal scale? Here is an example for your reference: -------------------------------------------------------------------- | sales | np.log(sales) | (np.log(sales)).diff() | predictions | -------------------------------------------------------------------- |166.594019 | 5.115560 | -0.045918 | -0.045918 | -------------------------------------------------------------------- Note: I have provided only one example which from index 2 as the first value after data.diff() will be null. And hence the prediction at …

Topic: transformation numpy arima time-series python

Category: Data Science

Feature engineering before splitting

Hing

2022年4月12日 12:49

This is a sister post to the original closed post (here). Since the data transformation part is done after data spliting on the TRAINING data only, I wonder wouldn't such transformation has dependency with how we subsample our data? We can have different transformation results when we pick different portion of training data. But I personally find it hard to convince myself that: isn't data transformation should be as invariant and generalizable as possible, across different subsamplings of dataset? Also, …

Topic: transformation features feature-engineering feature-selection

Category: Data Science

Effect of log odds on skewed data

Apoorva

2022年4月10日 15:54

Does taking the log of odds bring linearity between the odds of the dependent variable & the independent variables by removing skewness in the data? Is this one reason why we use log of odds in logistic regression? If yes, then is log transformation of data values unnecessary in logistic regression?

Topic: logistic transformation logarithmic logistic-regression

Category: Data Science

How to predict an outcome of the game (next row) based on all previous games (rows)?

Jamess11

2022年3月30日 00:47

I'm a data science student and I've come across a fairly unusual dataset (to me, which explains the vague title). It's of the following form: STAT_1 STAT_2 ... HOME AWAY NEXT_HOME NEXT_AWAY NEXT_RESULT 15 11 ... Team A Team B Team C Team D 1 11 18 ... Team C Team D Team E Team F 0 ... ... ... ... ... ... ... ... 10 11 ... Team W Team X Team Y Team Z 1 Basically, the rows …

Topic: binary-classification transformation machine-learning-model prediction classification

Category: Data Science

How to feature engineering after getting test data in deployment?

Pritam Sinha

2022年3月24日 00:06

I am kind of confuse about this topic of feature engineering. I am trying to make an web app in which people can upload test data as csv. Now I am confuse about how to do feature engineering after deploy the app, especially how to handle outliers and missing value? Suppose I want to change all the outliers of the test data with Q3+(1.5*IQR) value. My confusion is should I use the training dataset's calculated Q3+(1.5*IQR) value to change all …

Topic: transformation feature-engineering

Category: Data Science

Should i always transform data to normal distribution?

canP

2022年3月23日 17:33

I am trying to understand transformations but this question seems to be in my and some people's mind. If we have a numeric variable in EVERY data science case. Transforming data(Log, power transforms) into normal distribution will help the model to learn better? And stationarity. Stationarity is a different thing than transforming data to make it have a normal distribution. Is Transforming EVERY numeric data to stationery will make EVERY model learn better too?

Topic: transformation deep-learning data-cleaning data-mining

Category: Data Science

Transforming Categorical to Numerical variable

2021年12月21日 12:59

I have a categorical variable with 4 levels ('8 c', '6 c','NAN','Others') and I want to convert it to numerical form. an Obvious way is to simply remove the 'c' part from the first two categories and replace NAN with 0. However, I was wondering about the 'Others' level? What could be the best way to transform this level? Please note that the variable represents the number of cylinders for a given car.

Topic: transformation feature-engineering numerical categorical-data

Category: Data Science

320GB `YYYY/MM/DD/HH/*.json.gz` -> `YYYY/MM/tenant_id=x/data.parquet`?

Andrei Serdeliuc ॐ

2021年12月18日 21:05

I have 1mil gzipped files which contain in total 350mil \n separated json objects. 26GB compressed, ~320GB uncompressed, representing 7 years of data for a multi-tenant application. I want to create one parquet file per tenant per month. tenant_id is a property of each object. All objects have the same structure. There are ~30 properties. Property values can be missing, booleans may be quoted "true" or unquoted true, etc. Many of my attempts failed until now, with all the tools …

Topic: transformation json python

Category: Data Science

Is there a Softmax-like transformation with scale-invariance and linarity?

Someone2

2021年12月2日 15:12

At the moment I'm using XGBoost to generate a prediction of probabilities with a custom objective-function to build something like an expert system. To do so I need to transform the raw XGBoost predictions into a probability distribution, where every value lies in the range from 0 to 1 and they all sum up to 1. Naturally you start out with the Softmax transformation. But as it turns out this function has some significant drawbacks for this kind of application. …

Topic: transformation softmax xgboost

Category: Data Science

Outlier treatment

spectre

2021年11月18日 12:37

I am working on a regression problem where I have a lot of outliers in multiple variables. As far as I can think of, there are 3 things I can do to outliers. Remove them (least attractive option) Transform them (log transformation, box-cox transformation etc) Do nothing and build a model including them My question is regarding the second point. If I want to transform my features using any of the transformations solely for the purpose of outlier, is it …

Topic: transformation feature-engineering outlier python

Category: Data Science

Should one log transform discrete numerical variables?

spectre

2021年11月17日 15:28

I am working on a Linear Regression problem and one of the assumptions of a Linear Regression model is that the features should be Normally Distributed. Hence to convert my non linear features to linear, I am performing several transformations like log, box-cox, square-root transformation etc. I have both, discrete and continuous numerical variables (an example of each along with their histograms and qq plot is given): CONTINUOUS VARIABLE HISTOGRAM AND QQ PLOT DISCRETE VARIABLE HISTOGRAM AND QQ PLOT From …

Topic: transformation feature-engineering linear-regression python

Category: Data Science

Un-Pivot Data in Tableau

vicatcu

2021年10月1日 21:20

Say you have data with fields named: A, B, C, KEY, VALUE. And lets say the KEY field contains a discrete set of possible values like "X", "Y", and "Z". How do you transform your data with Tableau so that your resulting data has fields: A, B, C, X, Y, Z? Given an original record set of records that have A=a, B=b, C=c: the value for X should be the VALUE from the original record containing A=a, B=b, C=c and …

Topic: transformation tableau

Category: Data Science

Should I apply a transformation to columns with INTEGERS, in case I want to reduce the skewness of that column?

leahnanno

2021年9月6日 14:23

I am performing EDA on a dataset of Hotel Reservations. Target is Categorical stating if a given customer will cancel the reservation or not. Dataset has 25 features, 30244 entries. I have two features stating the number of adults and the number of babies coming with the person who made the reservation. Number of adults can be 1, 2, 3, 4, or 5. (Range specifically given in dataset description) Number of babies in the train set take values 0, 1, …

Topic: transformation dataset data-cleaning

Category: Data Science

Transforming time series into static features?

lte__

2021年8月7日 02:11

I'm working on a side project where I have a mixture of static data and time series, and the goal would be to perform clustering on the data. There's a bunch of data sources, but basically the main thing would be some static information about users (like age, sex, location etc.) and some time series data (user 123 did xyz at 2pm, then yxz at 3pm, then yyy at 4pm). The goal would be to perform a clustering/segmentation via unsupervised …

Topic: transformation time-series clustering machine-learning

Category: Data Science

Algorithm for learning image distortion?

gKhagb

2021年7月12日 20:00

I'm looking for tools to characterize relationships between gridded outputs of multiple physical models as image distortions. For instance, given a 2-d picture of the temperature distribution in two rooms, one might characterize it by a contraction of an upper layer of warm air: The inverse problem I am interested in is inferring this contraction using the two fields as inputs. I understand that this may often be an underdetermined problem and am prepared to regularize as necessary by imposing, …

Topic: transformation image-preprocessing

Category: Data Science

How to reshape or clean data to be able to visualize it with violin plots?

zmike

2021年6月26日 02:10

My end goal is to visualize some data using a violin plot or something similar using Python. I have the following data in a file (test.csv). The first column is a list of species. The other columns determine abundance of the species at a certain latitude (e.g. how abundant is species A at altitude 1000, 2000?). (Ignoring units for now.) How can I plot this as a violin plot (or something similar)? test.csv species,1000,2000,3000,4000,5000,6000,7000 species_A,0.5,0.5,,,2,1,2 species_B,0.5,1,0.5,0.5,1,1,10 species_C,1,1,10,3,15,4,5 species_D,15,3,2,1,0.5,1,3 The Python …

Topic: transformation visualization python data-cleaning

Category: Data Science

What if outliers still exist after variable transformation?

Joe

2021年5月27日 09:54

I have a variable with a skewed distribution. I applied BoxCox transformation and now the variable follows a Gaussian distribution. But, as seen in the image below in the boxplot, outliers still exist. My question is: Although after transformation, the variable distribution is nearly Gaussian, if there are still outliers, should we still select this transformation? Or should we decide to use other techniques such as discretization in order to capture all outliers?

Topic: transformation feature-engineering outlier

Category: Data Science

Multiple regression with non-normal data in minitab - help

shar6580

2021年5月3日 10:28

I am aiming to assess the effect of BMI (continuous) on certain biomarkers (also continuous) whilst adjusting for several relevant variables (mixed categorical and continuous) using multiple regression. My data is non-normal which I believe violates one of the key assumptions of multiple linear regression. Whilst I think it can still be performed I think it affects significance testing which is an issue for me. I think I can transform the data and then perform regression but I'm not sure …

Topic: transformation non-parametric regression

Category: Data Science

About