I have a dataset that has the following columns: The variable I'm trying to predict is "rent". My dataset looks a lot similar to what happens in this notebook. I tried to normalize the rent column and the area column using log transformation since both columns had a positive skewness. Here's the rent column and area column distribution before and after the log transformation. Before: After: I thought after these changes my regression models would improve and in fact they …
What happens if a certain dataset contains different "groups" that follow different linear models? For example, let's imagine that examining the scatterplot of a certain feature $x_i$ against $y$ we can see that some points follow a linear relationship with a coefficient $\beta_A<0$ while other points clearly have $\beta_B>0$. We can infer that these points belong to two different populations, population $A$ responds negatively to high values of feature $x_i$ while population $B$ responds positively. We then create a categorical …
In transfer learning, often only the last layer of the network is retrained using gradient descent. However, the last layer of a common neural network performs only a linear transformation, so why do we use gradient descent and not linear (or logistic) regression to finetune the last layer?
I'm very new to data science (this is my hello world project), and I have a data set made up of a combination of review text and numerical data such as number of tables. There is also a column for reviews which is a float (avg of all user reviews for that restaurant). So a row of data could be like: { rating: 3.765, review: `Food was great, staff was friendly`, tables: 30, staff: 15, parking: 20 ... } So …
hope you are enjoying good health,i am trying to built a simple neural network which has to predict a shear wave well log values from other well logs,but my model's is stuck in mean absolute error of 2.45 and it is not improving further,i have changed the number of neurons,learning rate,loss function but of no use. Here is my model: tf.random.set_seed(42) model=tf.keras.Sequential([ tf.keras.layers.Dense(22,activation='relu'), tf.keras.layers.Dense(1) ]) #commpiling: model.compile( loss=tf.losses.mae, optimizer=tf.optimizers.Adam(learning_rate=0.006), metrics=['mae'] ) #fitting: history=model.fit(x_train,y_train,epochs=1000,verbose=0,) #evaluation: model.evaluate(x_test,y_test) here is the boxplot of …
I have several groups of features that I'd like to test against independent variables. The idea is to find which groups tend to be associated with a specific value of an independent variable. Let's take the following example where s are samples, f are features, i are independent variables associated with each s. s1 s2 s3 s4 .... f1 0.3 0.9 0.7 0.8 f2 ... f3 ... f4 ... f5 ... i1 low low med high i2 0.9 1.6 2.3 …
Fairly simple question, but something I've been unable to understand firmly by scouring the interwebs. After running a LR model using SKlearn, one of the key outputs is coef_ , along with intercept_. I understand that coef_ is a transformation matrix that fully describes the relationships of the model; and that performing the dot-product of the input data, with coef_ and adding intercept_ will produce the predicted values for your inputs. My question is: What is the equation that defines …
I have computed a simple linear regression model as below, but am confused as to whether the confint() function is sufficient to provide 95% confidence intervals around the standardised regression coefficient in the linear model (beta)? Has anyone else run into this issue or is confint() sufficient to extract the 95% confidence interval (i.e., +/-1.96 standard errors of the standardised regression coefficient)? h1a <- lm(formula = var1~ var2, data = df) # estimate value of intercept (b0) and slope (b1) …
I have a dataset that has high collinearity among variables. When I created the linear regression model, I could not include more than five variables ( I eliminated the feature whenever VIF>5). But I need to have all the variables in the model and find their relative importance. Is there any way around it?. I was thinking about doing PCA and creating models on principal components. Does it help?.
SUMMARY I'm building a linear regression model using Scikit and noticing that the model "performance" (RMSE and max error, namely) varies depending on whether I use the default LR or whether I apply PolynomialFeature(degree=1). My understanding is that these outcomes should be identical, since they are both utilizing a single-order LR model, however, my error is consistently lower when using the PolyFeatures version. TLDR When I run the code below, the second chunk (polynomial = degree of 1) is consistently …
sending positive wishes to y'all. I have about 10 years of growth rates in real estate prices and some other macroeconomic variables such as inflation, unemployment rates, fuel prices, growth in prices of raw materials among many others. I want to analyze the causality of all of these variables on the growth in real estate prices. The simplest thing would be to build a linear regression model, but given this is not cross-sectional and more like a time series data, …
Disclaimer: I'm relatively new to the data science and ML world -- still trying to get a firm grasp on the fundamentals. I'm trying to overcome a regression challenge involving a large, multi-dimensional dataset, but am hitting a roadblock when it comes to my input data. This dataset consists of a few key input criteria: [FLOW, TEMP, PRESSURE, VOLTAGE_A] and a single output variable, VOLTAGE_B (this is what I'm hoping to effectively model and predict). I'm able to handle this …
I wonder how I can use machine learning to plot multiple linear regression in a figure. I have one independent variable (prices of apartments) and five independent (floor, builtyear, roomnumber, square meter, kr/sqm). The task is first to use machine learning which gives the predicted values and the actual values. Then you have to plot those values in a figure. I have used this code: x_train, x_test, y_train, y_test = tts(xx1, y, test_size=3) Outcome: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False) regr.fit(x_train, y_train) …
I have 2 time series, $X$ and $Y$, and I'm trying to find the best lag range that correlates $X$ to $Y$ (find the amount(s) of lag of $X$ that best correlate to the target variable $Y$). For instance, if the best lag range is between $t = 8$ and $t = 10$, then the final equation would be $Y_t = \alpha_1 X_{t-8} + \alpha_2 X_{t-9} + \alpha_3 X_{t-10} + \alpha_4$. Since the value of $Y$ could depend not only …
I have a machine learning model that I fitted with xgboost and linear regression. My dataset has thirteen features and has price as the target. Is there any way to make the model predict values in the future? I have date time as one of the variables. From searching on internet, I learned about fb prophet, and that this is a time series problem. But if my xgboost is doing well, is there a way to make it predict future …
I have some dataset which contains different paramteres and data.head() looks like this Applied some preprocessing and performed Feature ranking - dataset = pd.read_csv("ML.csv",header = 0) #Get dataset breif print(dataset.shape) print(dataset.isnull().sum()) #print(dataset.head()) #Data Pre-processing data = dataset.drop('organization_id',1) data = data.drop('status',1) data = data.drop('city',1) #Find median for features having NaN median_zip, median_role_id, median_specialty_id, median_latitude, median_longitude = \ data['zip'].median(),\ data['role_id'].median(),\ data['specialty_id'].median(),\ data['latitude'].median(),\ data['longitude'].median() data['zip'].fillna(median_zip, inplace=True) data['role_id'].fillna(median_role_id, inplace=True) data['specialty_id'].fillna(median_specialty_id, inplace=True) data['latitude'].fillna(median_latitude, inplace=True) data['longitude'].fillna(median_longitude, inplace=True) #Fill YearOFExp with 0 data['years_of_experience'].fillna(0, inplace=True) target = dataset.location_id …
What is the easiest and easy to explain feature importance calculation for linear regression? I know I can use Shap to compute feature importance, but I find it difficult to explain it to stakeholders, and the coefficient is not a good measure of feature importance since it depends on the scale of the feature. Some suggested (standard deviation of feature)*feature coefficient as a good measure of feature importance.