I'm a psychology student and trying come up with a research plan involving GLM. I'm thinking about adding an interaction term in the analysis but I'm unsure about the interpretation of it. To make things simple, I'm going to use linear regression as an example. I'm expecting a (simplified) model like this: $$y = ax_{1} + bx_{2} + c(x_{1}*x_{2})+e$$ In my hypothesis, $x_{1}$ and $y$ are negatively correlated, and $x_{2}$ and $y$ are positiely correlated. As for correlation between $x_{1}$ …
I really need help with GAM. I have to find out whether association is linear or non-linear by using GAM. The predictor variable is temperature at lag0 and the output is cardiovascular admissions (count variable). I have tried a lot but I am not able to understand how to interpret the graph and output that I am getting. I tried this formula using mgcv package: model1<- gam(cvd ~ s(templg0), family=poisson) summary(model1) plot(model1) So here is the output for summary that …
I am trying to classify cars for a towing company. Junky cars earn more when sent to the junkyard, and the more valuable cars should earn more at the auction, despite the auction fee. Creating a logistic regression that takes into account Make, Model, Mileage, Year and Run status helps us improve the accuracy of which cars should go where, but a difficulty arises: Sometimes, a car that would be classified as junk can actually be an outlier, and sell …
I am trying to build a GLM model (poisson family) using python statsmodels package on train data. The data I have contains categorical values as exogenous variables and numerical values for my target (endegenous variable). I did standardization for numeric values and one-hot-encoding on categorical values (drop the first level). When I fit the data into the model, I got the following exceptions : ValueError: NaN, inf or invalid value detected in endog, estimation infeasible. When creating this model the …
Is it possible to plot the deviance residuals and leverage (e.g. cook's distance) of every observation fitted in a GLM model using H2O? From H2O's documentation, seems it only calculates the sum of all deviance residuals, but cannot output the residuals for each observation.
I want to try H2O's Model Selection function in Python, but cannot load the library for some reason. The following code failed: from h2o.estimators import H2OModelSelectionEstimator Error message: cannot import name 'H2OModelSelectionEstimator' from 'h2o.estimators' Other H2O libraries like H2OGeneralizedLinearEstimator worked fine for me though https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/model_selection.html
I work on a bag of words, on the Toxic Comments Classifications challenge. The challenge is closed but the dataset is very nice to learn. I use R, tf-idf, tm, and logistic regression. I have a strange pattern in the accuracy results, linked with the error: "glm.fit: algorithm did not converge". It tired the solution proposed in other answers and multiplied maxit by 4, but it did not help. Glimpse of the functions used sub-sampling Original distribution is 200K non-toxic …
I will try to keep this short. As an assignment for my GLM course, we were given a dataset on the # of homicide victims a person knows, as well as the race of the person. The main idea is to answer the scientific question "Does race help explain how many homicide victims a person knows?". This same dataset, and actually nearly all the sub-problems are solved here: https://data.library.virginia.edu/getting-started-with-negative-binomial-regression-modeling/. My issue is, I am struggling to understand the difference between …
Hypothetically, if your company's sales had dropped significantly in 2020, what approach would you take to describe the cause? can you build a model to predict the decrease (between 2019 and 2020 for example) to visualize what the leading indicators are?
I'm trying to create a logistic regression model using Ridge, this is the code: glmnet(X_Train, Y_Train, family='binomial', alpha=0, type.measure='auc') And this is the error message I'm getting: Error in storage.mode(xd) <- "double" : 'list' object cannot be coerced to type 'double' I tried converting all the variables into "numeric" but still doesn't work. I'm going to post the code for those two datasets so you can reproduce it: libraries: library(dplyr) library(fastDummies) library(missForest) library(glmnet) Data: url <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data' crx <- read.csv(url, …
I'm trying to fit a GLM to predict continuous variables between 0 and 1 with statsmodels. Because I have more features than data, I need to regularize. statsmodels has very few examples, so I'm not sure if I'm doing this correctly. import statsmodels.api as sm logistic_regression_model = sm.GLM( y, # shape (num data,) X, # shape (num data, num features) link=sm.genmod.families.links.logit) results = logistic_regression_model.fit_regularized(alpha=1.) results.summary() When I run this, asking for a summary raises an error. NotImplementedError Traceback (most recent …
I'm a beginner in Machine learning and I've studied that collinearity among the predictor variables of a model is a huge problem since it can lead to unpredictable model behaviour and a large error. But, are there some models (say GLM) that are perhaps 'okay' with collinearity unlike the classic linear regression? It is said that the classic linear regression assumes there is no correlation between its independent variables. This question arises because I was doing a project that said …
I am trying to run LOOCV on my regression model. I tried to run it in r and encountered the following warning message: Warning message in y - yhat: "longer object length is not a multiple of shorter object length” This is my model: x=glm(x,data = full_data) mse_loocv=cv.glm(full_data,x) mse_loocv$delta Variables used in glm are: x-> target_deathrate ~ avganncount + avgdeathsperyear + incidencerate + medincome + popest2015 + povertypercent + studypercap + medianage + medianagemale + medianagefemale + percentmarried + pctnohs18_24 …
I am trying to model a response variable which is a proportion (so a response between 0 and 1, see picture for distribution). Ideally I would like to model it without using the actual counts, so as a decimal. So far I have been using a binomial family in R. model <- glm(Response ~ X1 + X2 + X3, data = Training_data, family = 'binomial') I think the model is doing okay, but when I use it for predictions it …
When I fit a linear model with many predictor variables, I can avoid writing all of them by using . as follows: model = lm(target_deathrate~., data = full_data) But for models with higher complexity, I cannot make this work: x = glm(target_deathrate~poly(., i),data = full_data) In these cases I have to write all variables. How to avoid writing all variable names and include all variables in my model?
I have panel data for 3 countries, ranging over 3 years. The dataset is called CarProduction Country Year cars Fuel_price PPP Manufact PublicTransport USA 2015 500 5 10000 9 2 USA 2016 700 5.2 10500 8.75 2.2 USA 2017 780 5.4 11000 8.6 1.9 China 2015 150 9 4000 11 3 China 2016 200 8.6 4500 11.5 4 China 2017 340 9.4 6000 15.6 5 Italy 2015 200 9 4000 11 5 Italy 2016 300 8.6 4500 11.5 6.2 Italy …
I'm studying occurence of Behavior11, Behavior12,Behavior2,Behavior3 according three variables : Times : task time Time_interval :task time in interval Frequency:Frequency of the task For this purpose, I use GLM attach(datas) an11=anova(glm(Behavior11 ~ Times + Frequency , family=binomial),test="Chisq") an12=anova(glm(Behavior12 ~ Times_interval+ Frequency , family=binomial),test="Chisq") an3=anova(glm(Behavior3 ~ Times_interval+ Frequency , family=binomial),test="Chisq") an2=anova(glm(Behavior2 ~ Times_interval+ Frequency , family=binomial),test="Chisq") I have different significant effect for every behavior. Odds value reveals the direction of dependence exemple: model1=glm(Behavior12 ~ Times_interval+ Frequency, family=binomial) summary(model1) exp(model1$coefficients) Coefficients: Estimate …
I am facing an issue where I have 7 sets of different variables/columns/predictors. I am trying to predict same target variable and I want to observe the importance/effect of all the sets according to their importance in an ordered manner. (I am trying to use ridge regression models for each of the 7 individual set as I want to keep all the variables and I want to combine the output of these 7 models, each set has more than 20 …