Finding data with transformation applied

Is there a way to find relatedness between data and the data obtained after some transformation applied to it? i.e. given a data I need to find the most related data(most of the values in that data can be obtained) that can be found by applying some transformation in original data. I tried but couldn't find a proper answer, most of the discussion that I found is about linear transformation or log transformation but I want to find a way …
Category: Data Science

How to revert np.log(data) and data.diff()?

I have used np.log(data) and then applied data.diff() to transform my data in timeseries model. I have the predictions. How do I convert it back to normal scale? Here is an example for your reference: -------------------------------------------------------------------- | sales | np.log(sales) | (np.log(sales)).diff() | predictions | -------------------------------------------------------------------- |166.594019 | 5.115560 | -0.045918 | -0.045918 | -------------------------------------------------------------------- Note: I have provided only one example which from index 2 as the first value after data.diff() will be null. And hence the prediction at …
Category: Data Science

Feature engineering before splitting

This is a sister post to the original closed post (here). Since the data transformation part is done after data spliting on the TRAINING data only, I wonder wouldn't such transformation has dependency with how we subsample our data? We can have different transformation results when we pick different portion of training data. But I personally find it hard to convince myself that: isn't data transformation should be as invariant and generalizable as possible, across different subsamplings of dataset? Also, …
Category: Data Science

How to predict an outcome of the game (next row) based on all previous games (rows)?

I'm a data science student and I've come across a fairly unusual dataset (to me, which explains the vague title). It's of the following form: STAT_1 STAT_2 ... HOME AWAY NEXT_HOME NEXT_AWAY NEXT_RESULT 15 11 ... Team A Team B Team C Team D 1 11 18 ... Team C Team D Team E Team F 0 ... ... ... ... ... ... ... ... 10 11 ... Team W Team X Team Y Team Z 1 Basically, the rows …
Category: Data Science

How to feature engineering after getting test data in deployment?

I am kind of confuse about this topic of feature engineering. I am trying to make an web app in which people can upload test data as csv. Now I am confuse about how to do feature engineering after deploy the app, especially how to handle outliers and missing value? Suppose I want to change all the outliers of the test data with Q3+(1.5*IQR) value. My confusion is should I use the training dataset's calculated Q3+(1.5*IQR) value to change all …
Category: Data Science

Should i always transform data to normal distribution?

I am trying to understand transformations but this question seems to be in my and some people's mind. If we have a numeric variable in EVERY data science case. Transforming data(Log, power transforms) into normal distribution will help the model to learn better? And stationarity. Stationarity is a different thing than transforming data to make it have a normal distribution. Is Transforming EVERY numeric data to stationery will make EVERY model learn better too?
Category: Data Science

Transforming Categorical to Numerical variable

I have a categorical variable with 4 levels ('8 c', '6 c','NAN','Others') and I want to convert it to numerical form. an Obvious way is to simply remove the 'c' part from the first two categories and replace NAN with 0. However, I was wondering about the 'Others' level? What could be the best way to transform this level? Please note that the variable represents the number of cylinders for a given car.
Category: Data Science

320GB `YYYY/MM/DD/HH/*.json.gz` -> `YYYY/MM/tenant_id=x/data.parquet`?

I have 1mil gzipped files which contain in total 350mil \n separated json objects. 26GB compressed, ~320GB uncompressed, representing 7 years of data for a multi-tenant application. I want to create one parquet file per tenant per month. tenant_id is a property of each object. All objects have the same structure. There are ~30 properties. Property values can be missing, booleans may be quoted "true" or unquoted true, etc. Many of my attempts failed until now, with all the tools …
Category: Data Science

Is there a Softmax-like transformation with scale-invariance and linarity?

At the moment I'm using XGBoost to generate a prediction of probabilities with a custom objective-function to build something like an expert system. To do so I need to transform the raw XGBoost predictions into a probability distribution, where every value lies in the range from 0 to 1 and they all sum up to 1. Naturally you start out with the Softmax transformation. But as it turns out this function has some significant drawbacks for this kind of application. …
Category: Data Science

Outlier treatment

I am working on a regression problem where I have a lot of outliers in multiple variables. As far as I can think of, there are 3 things I can do to outliers. Remove them (least attractive option) Transform them (log transformation, box-cox transformation etc) Do nothing and build a model including them My question is regarding the second point. If I want to transform my features using any of the transformations solely for the purpose of outlier, is it …
Category: Data Science

Should one log transform discrete numerical variables?

I am working on a Linear Regression problem and one of the assumptions of a Linear Regression model is that the features should be Normally Distributed. Hence to convert my non linear features to linear, I am performing several transformations like log, box-cox, square-root transformation etc. I have both, discrete and continuous numerical variables (an example of each along with their histograms and qq plot is given): CONTINUOUS VARIABLE HISTOGRAM AND QQ PLOT DISCRETE VARIABLE HISTOGRAM AND QQ PLOT From …
Category: Data Science

Un-Pivot Data in Tableau

Say you have data with fields named: A, B, C, KEY, VALUE. And lets say the KEY field contains a discrete set of possible values like "X", "Y", and "Z". How do you transform your data with Tableau so that your resulting data has fields: A, B, C, X, Y, Z? Given an original record set of records that have A=a, B=b, C=c: the value for X should be the VALUE from the original record containing A=a, B=b, C=c and …
Category: Data Science

Should I apply a transformation to columns with INTEGERS, in case I want to reduce the skewness of that column?

I am performing EDA on a dataset of Hotel Reservations. Target is Categorical stating if a given customer will cancel the reservation or not. Dataset has 25 features, 30244 entries. I have two features stating the number of adults and the number of babies coming with the person who made the reservation. Number of adults can be 1, 2, 3, 4, or 5. (Range specifically given in dataset description) Number of babies in the train set take values 0, 1, …
Category: Data Science

Transforming time series into static features?

I'm working on a side project where I have a mixture of static data and time series, and the goal would be to perform clustering on the data. There's a bunch of data sources, but basically the main thing would be some static information about users (like age, sex, location etc.) and some time series data (user 123 did xyz at 2pm, then yxz at 3pm, then yyy at 4pm). The goal would be to perform a clustering/segmentation via unsupervised …
Category: Data Science

Algorithm for learning image distortion?

I'm looking for tools to characterize relationships between gridded outputs of multiple physical models as image distortions. For instance, given a 2-d picture of the temperature distribution in two rooms, one might characterize it by a contraction of an upper layer of warm air: The inverse problem I am interested in is inferring this contraction using the two fields as inputs. I understand that this may often be an underdetermined problem and am prepared to regularize as necessary by imposing, …
Category: Data Science

How to reshape or clean data to be able to visualize it with violin plots?

My end goal is to visualize some data using a violin plot or something similar using Python. I have the following data in a file (test.csv). The first column is a list of species. The other columns determine abundance of the species at a certain latitude (e.g. how abundant is species A at altitude 1000, 2000?). (Ignoring units for now.) How can I plot this as a violin plot (or something similar)? test.csv species,1000,2000,3000,4000,5000,6000,7000 species_A,0.5,0.5,,,2,1,2 species_B,0.5,1,0.5,0.5,1,1,10 species_C,1,1,10,3,15,4,5 species_D,15,3,2,1,0.5,1,3 The Python …
Category: Data Science

What if outliers still exist after variable transformation?

I have a variable with a skewed distribution. I applied BoxCox transformation and now the variable follows a Gaussian distribution. But, as seen in the image below in the boxplot, outliers still exist. My question is: Although after transformation, the variable distribution is nearly Gaussian, if there are still outliers, should we still select this transformation? Or should we decide to use other techniques such as discretization in order to capture all outliers?
Category: Data Science

Multiple regression with non-normal data in minitab - help

I am aiming to assess the effect of BMI (continuous) on certain biomarkers (also continuous) whilst adjusting for several relevant variables (mixed categorical and continuous) using multiple regression. My data is non-normal which I believe violates one of the key assumptions of multiple linear regression. Whilst I think it can still be performed I think it affects significance testing which is an issue for me. I think I can transform the data and then perform regression but I'm not sure …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.