Why Scikit and statsmodel provide different Coefficient of determination?

First of all, I know there is a similar question, however, I didn't find it so much helpful. My issue is concerning simple Linear regression and the outcome of R-Squared. I founded that results can be quite different if I use statsmodels and Scikit-learn. First of all my snippet: import altair as alt import numpy as np import pandas as pd from sklearn.linear_model import LinearRegression import statsmodels.api as sm np.random.seed(0) data = pd.DataFrame({ 'Date': pd.date_range('1990-01-01', freq='D', periods=50), 'NDVI': np.random.uniform(low=-1, high=1, …
Category: Data Science

Timeseries VAR vs VARMA model: issue in time to fit model

I want to use VARMA model on a data of about 80000 samples with 10 features. I tried using VARMA model from statsmodels with p=50 and q=10 but it is taking too much time to build the model. I tested the model was running even after 12 hours. Then I tested VARMA using p=50 and q=0, this also was running even after an hour with maxiter=1. The code I am using is: from statsmodels.tsa.statespace.varmax import VARMAX modelVARMA = VARMAX(dff, order=(50,0)) …
Category: Data Science

How to interpret my logistic regression result with statsmodels

so I'am doing a logistic regression with statsmodels and sklearn. My result confuses me a bit. I used a feature selection algorithm in my previous step, which tells me to only use feature1 for my regression. The results are the following: So the model predicts everything with a 1 and my P-value is < 0.05 which means its a pretty good indicator to me. But the accuracy score is < 0.6 what means it doesn't say anything basically. Can you …
Category: Data Science

Why coefficients from logistic regression are not proportional to bad rate?

I am building a logistic regression model in Python with statsmodels.api.Logit. The model contains 12 features that are encoded using pandas.get_dummies(). My final training dataset (xTrain) looks like this: feature1_A feature1_B feature2_B feature2_C feature_2_D 0 1 0 1 0 0 1 1 0 0 1 0 0 1 0 feature1 is a categorical feature that contains 3 modalities (or categories) A, B, and C (C is used as a base reference so it does not appear in my training set) …
Category: Data Science

NaN, inf or invalid value detected in endog, estimation infeasible error when training statsmodels GLM model

I am trying to build a GLM model (poisson family) using python statsmodels package on train data. The data I have contains categorical values as exogenous variables and numerical values for my target (endegenous variable). I did standardization for numeric values and one-hot-encoding on categorical values (drop the first level). When I fit the data into the model, I got the following exceptions : ValueError: NaN, inf or invalid value detected in endog, estimation infeasible. When creating this model the …
Category: Data Science

Persistence and stationarity together

I am trying to analyse a time series. I want to get only quantitative results (so, I'm excluding things like "looking at this plot we can note..." or "as you can see in the chart ..."). In my job, I analyse stationarity and persistence. First, I run ADF test and get "stationary" or "non-stationary" as results. Then, I need to work on persistence. To do so, I use ACF. My question is: suppose I got "non-stationary" time series. Is it …
Category: Data Science

Difference in statsmodel output vs direct linear algebra with input binary variable

I was wondering why there might be a difference when I run a simple multiple linear regression with statsmodels OLS, vs just doing it directly with numpy. The results are identical for both cases, so long as I don't include sex (binary) as one of the predictor variables. I am wondering why this might be the case, and which to prefer in this case? I noticed that in the output of statsmodels it also says Sex[T.1] which may be related …
Category: Data Science

Statsmodel manually set/restore coefficients of model

I was wondering if it is possible to manually restore the coefficients of a given model? That is, given a computed set of coefficients, to reinitialize another statsmodel with those parameter (coefficient) outputs? I have tried doing so (in the context of OLS multiple linear regression), but have gotten errors, and I suspect it is because I try to restore the coefficients by fitting to a single sample dataframe (which is a test set), and that maybe alters some properties …
Category: Data Science

PACF for Airline Passengers dataset: What's wrong?

The airline passengers dataset is available here, but it also comes with in R. I'm working with python, and I import the following (besides the usual like pandas and numpy.) from statsmodels.tsa.stattools import pacf,acf from statsmodels.graphics import tsaplots from statsmodels.tsa.stattools import adfuller,kpss from statsmodels.tsa.statespace.sarimax import SARIMAX from scipy import stats I'm applied the log, and then 1-period difference for detrending , and then 12-period for 'deseasonality'. Then I drop the nan, with df_log_dif_dif12.dropna(inplace=True). I obtain the following numpy array: array([[ …
Category: Data Science

Does this ARIMA model take seasonality into account?

I'm writing a tutorial on traditional time series forecasting models. One key issue with ARIMA models is that they cannot model seasonal data. So, I wanted to get some seasonal data and show that the model cannot handle it. However, it seems to model the seasonality quite easily - it peaks every 4 quarters as per the original data. What is going on? Code to reproduce the plot from statsmodels.datasets import get_rdataset from statsmodels.tsa.arima.model import ARIMA import matplotlib.pyplot as plt …
Category: Data Science

How do I use number of hours as index in timeseries forecasting?

I have a dataset that has number of hours (consecutive value) and total sales in that 1 hour in my dataset. See below for head of the dataset: t sales -------------- 23 172.3676 24 176.3456 25 166.9039 26 153.9990 27 167.9585 I want to forecast the sales for the next 10 hours. I also set column t as the index. However, when I try to get the seasonal decomposition, it shows an error: result = seasonal_decompose(train['sales'].dropna(), model='additive', freq =12) result.plot() …
Category: Data Science

Mutiple binary classification for for best propensity to buy one of the product

Problem:- I have 5 products for sell and I can pitch only one product in a month to one customer.so I wants to know which product customer can buy. Proposed solution:- I build 5 binary logistic models to understand the probability of each customer to buy particular product. where I am getting 5 probabilities. so what ever model is giving maximum probability amongst 5 I am pitching that product to customer for an example If we have Product A,B,C,D,E to …
Category: Data Science

Sarimax fit for prediction further into future

I want to fit sarimax model of statsmodels so that it is optimized for predicting into future not just the next sample. Let's say predicting 5 time steps ahead. I can do this by model.forecast(5) but what I am trying to do is actually fit the model like this so it learns how to best predict 5 time steps ahead. Is it possible?
Category: Data Science

Selecting the best model parameters from grid search SARIMA [Time series]

I ran a manual gridsearch of SARIMA across several parameters and now I have 7875 rows of scores (RMSE, MAE, MAPE each) from it. These were the parameters (30k+ permutations) I ran a grid search over- p = [0 to 10] d = [0,1,2] q = [0 to 12] P = [0 to 5] D = [0,1] Q = [0,1,2] S = [0,7] These are the top 20 rows of the results sorted by RMSE in ascending. Parameters are in …
Category: Data Science

How to do backward features elimination when considering interactions between them

I have a multi linear regression problem, $Y$ is my target and $X_1, X_2, X_3$ are my features. In my regression, I consider the interaction between $X_1, X_2, X_3$ and I add a bias. So my problem is given by : $Y \sim X_1 + X_2 + X_3 + X_1X_2 + X_1X_3+ X_2X_3+ bias$ Now, I fit my model with statsmodels.api.sm and I want to eliminate the feature the highest p value recursively. My first question is : for example, …
Category: Data Science

Selecting most important features for multilinear regression

I have a set of 25 features. I would like to choose the best features for my model. Originally, I was looking at the correlation of features with respect to response, and only taking those which are highly correlated and run a regression model. Then, using that model I would predict the outcome based on test data, and compare it to actual (metric RMSE) and this would be how I assess it. I could then add each feature in order …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.