Linear Regression bad results after log transformation

I have a dataset that has the following columns: The variable I'm trying to predict is "rent". My dataset looks a lot similar to what happens in this notebook. I tried to normalize the rent column and the area column using log transformation since both columns had a positive skewness. Here's the rent column and area column distribution before and after the log transformation. Before: After: I thought after these changes my regression models would improve and in fact they …
Category: Data Science

Dealing with diverse groups in regression

What happens if a certain dataset contains different "groups" that follow different linear models? For example, let's imagine that examining the scatterplot of a certain feature $x_i$ against $y$ we can see that some points follow a linear relationship with a coefficient $\beta_A<0$ while other points clearly have $\beta_B>0$. We can infer that these points belong to two different populations, population $A$ responds negatively to high values of feature $x_i$ while population $B$ responds positively. We then create a categorical …
Category: Data Science

Why not using linear regression for finetuning the last layer of a neural network?

In transfer learning, often only the last layer of the network is retrained using gradient descent. However, the last layer of a common neural network performs only a linear transformation, so why do we use gradient descent and not linear (or logistic) regression to finetune the last layer?
Category: Data Science

How to combine nlp and numeric data for a linear regression problem

I'm very new to data science (this is my hello world project), and I have a data set made up of a combination of review text and numerical data such as number of tables. There is also a column for reviews which is a float (avg of all user reviews for that restaurant). So a row of data could be like: { rating: 3.765, review: `Food was great, staff was friendly`, tables: 30, staff: 15, parking: 20 ... } So …
Category: Data Science

how to tune hyperparameters inn regression neural network

hope you are enjoying good health,i am trying to built a simple neural network which has to predict a shear wave well log values from other well logs,but my model's is stuck in mean absolute error of 2.45 and it is not improving further,i have changed the number of neurons,learning rate,loss function but of no use. Here is my model: tf.random.set_seed(42) model=tf.keras.Sequential([ tf.keras.layers.Dense(22,activation='relu'), tf.keras.layers.Dense(1) ]) #commpiling: model.compile( loss=tf.losses.mae, optimizer=tf.optimizers.Adam(learning_rate=0.006), metrics=['mae'] ) #fitting: history=model.fit(x_train,y_train,epochs=1000,verbose=0,) #evaluation: model.evaluate(x_test,y_test) here is the boxplot of …
Category: Data Science

Relationships between groups of features against independent variables

I have several groups of features that I'd like to test against independent variables. The idea is to find which groups tend to be associated with a specific value of an independent variable. Let's take the following example where s are samples, f are features, i are independent variables associated with each s. s1 s2 s3 s4 .... f1 0.3 0.9 0.7 0.8 f2 ... f3 ... f4 ... f5 ... i1 low low med high i2 0.9 1.6 2.3 …
Category: Data Science

What Equation is model.coef_ Derived From? (SKLearn)

Fairly simple question, but something I've been unable to understand firmly by scouring the interwebs. After running a LR model using SKlearn, one of the key outputs is coef_ , along with intercept_. I understand that coef_ is a transformation matrix that fully describes the relationships of the model; and that performing the dot-product of the input data, with coef_ and adding intercept_ will produce the predicted values for your inputs. My question is: What is the equation that defines …
Category: Data Science

confidence interval around standardised regression coefficient?

I have computed a simple linear regression model as below, but am confused as to whether the confint() function is sufficient to provide 95% confidence intervals around the standardised regression coefficient in the linear model (beta)? Has anyone else run into this issue or is confint() sufficient to extract the 95% confidence interval (i.e., +/-1.96 standard errors of the standardised regression coefficient)? h1a <- lm(formula = var1~ var2, data = df) # estimate value of intercept (b0) and slope (b1) …
Category: Data Science

Does PCA helps to include all the variables even if there is high collinearity among variables?

I have a dataset that has high collinearity among variables. When I created the linear regression model, I could not include more than five variables ( I eliminated the feature whenever VIF>5). But I need to have all the variables in the model and find their relative importance. Is there any way around it?. I was thinking about doing PCA and creating models on principal components. Does it help?.
Category: Data Science

SKLearn - Different Results B/w Default Linear Model and1st Order Polynomial Linear Model

SUMMARY I'm building a linear regression model using Scikit and noticing that the model "performance" (RMSE and max error, namely) varies depending on whether I use the default LR or whether I apply PolynomialFeature(degree=1). My understanding is that these outcomes should be identical, since they are both utilizing a single-order LR model, however, my error is consistently lower when using the PolyFeatures version. TLDR When I run the code below, the second chunk (polynomial = degree of 1) is consistently …
Category: Data Science

Recommendations for modelling panel data

sending positive wishes to y'all. I have about 10 years of growth rates in real estate prices and some other macroeconomic variables such as inflation, unemployment rates, fuel prices, growth in prices of raw materials among many others. I want to analyze the causality of all of these variables on the growth in real estate prices. The simplest thing would be to build a linear regression model, but given this is not cross-sectional and more like a time series data, …
Category: Data Science

How to Approach Linear Machine-Learning Model When Input Variables are Inconsistent

Disclaimer: I'm relatively new to the data science and ML world -- still trying to get a firm grasp on the fundamentals. I'm trying to overcome a regression challenge involving a large, multi-dimensional dataset, but am hitting a roadblock when it comes to my input data. This dataset consists of a few key input criteria: [FLOW, TEMP, PRESSURE, VOLTAGE_A] and a single output variable, VOLTAGE_B (this is what I'm hoping to effectively model and predict). I'm able to handle this …
Category: Data Science

Linear Regression Coefficient Calculation

class LR: def __init__(self, x, y): self.x = x self.y = y self.xmean = np.mean(x) self.ymean = np.mean(y) self.x_xmean = self.x - self.xmean self.y_ymean = self.y - self.ymean self.covariance = sum(self.x_xmean * self.y_ymean) self.variance = sum(self.x_xmean * self.x_xmean) def getYhat(self, input_x): input_x = np.array(input_x) return self.intercept + self.slope * input_x def getCoefficients(self): self.slope = self.covariance/self.variance self.intercept = self.ymean - (self.xmean * self.slope) return self.intercept, self.slope I am using the above class to calculate intercept and slope for a Simple Linear …
Category: Data Science

Multiple regression (using machine learning - how plot data)

I wonder how I can use machine learning to plot multiple linear regression in a figure. I have one independent variable (prices of apartments) and five independent (floor, builtyear, roomnumber, square meter, kr/sqm). The task is first to use machine learning which gives the predicted values and the actual values. Then you have to plot those values in a figure. I have used this code: x_train, x_test, y_train, y_test = tts(xx1, y, test_size=3) Outcome: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False) regr.fit(x_train, y_train) …
Category: Data Science

How to find lagged cross correlation between time series?

I have 2 time series, $X$ and $Y$, and I'm trying to find the best lag range that correlates $X$ to $Y$ (find the amount(s) of lag of $X$ that best correlate to the target variable $Y$). For instance, if the best lag range is between $t = 8$ and $t = 10$, then the final equation would be $Y_t = \alpha_1 X_{t-8} + \alpha_2 X_{t-9} + \alpha_3 X_{t-10} + \alpha_4$. Since the value of $Y$ could depend not only …
Category: Data Science

Make fitted xgboost or linear regression model predicts values in thé future

I have a machine learning model that I fitted with xgboost and linear regression. My dataset has thirteen features and has price as the target. Is there any way to make the model predict values in the future? I have date time as one of the variables. From searching on internet, I learned about fb prophet, and that this is a time series problem. But if my xgboost is doing well, is there a way to make it predict future …
Category: Data Science

How do I correctly build model on given data to predict target parameter?

I have some dataset which contains different paramteres and data.head() looks like this Applied some preprocessing and performed Feature ranking - dataset = pd.read_csv("ML.csv",header = 0) #Get dataset breif print(dataset.shape) print(dataset.isnull().sum()) #print(dataset.head()) #Data Pre-processing data = dataset.drop('organization_id',1) data = data.drop('status',1) data = data.drop('city',1) #Find median for features having NaN median_zip, median_role_id, median_specialty_id, median_latitude, median_longitude = \ data['zip'].median(),\ data['role_id'].median(),\ data['specialty_id'].median(),\ data['latitude'].median(),\ data['longitude'].median() data['zip'].fillna(median_zip, inplace=True) data['role_id'].fillna(median_role_id, inplace=True) data['specialty_id'].fillna(median_specialty_id, inplace=True) data['latitude'].fillna(median_latitude, inplace=True) data['longitude'].fillna(median_longitude, inplace=True) #Fill YearOFExp with 0 data['years_of_experience'].fillna(0, inplace=True) target = dataset.location_id …
Category: Data Science

Feature importance of a linear regression

What is the easiest and easy to explain feature importance calculation for linear regression? I know I can use Shap to compute feature importance, but I find it difficult to explain it to stakeholders, and the coefficient is not a good measure of feature importance since it depends on the scale of the feature. Some suggested (standard deviation of feature)*feature coefficient as a good measure of feature importance.
Category: Data Science

please help, i got an error while trying to my data, i got an error like x and y must be thesame size

import pandas as pd import numpy as np import matplotlib.pyplot as plt data = pd.read_csv('housing.csv') data.drop('ocean_proximity', axis=1, inplace = True) data.head() longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value 0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.