Regression model for continuous dependent variable and count independent variables

I am currently learning R and I am relatively inexperienced in the field. Hope I can get some advice from you guys!

I am working on a project where I have to estimate the average processing time of different work items (tasks).

I have the following panel data:

My sample size is n=2000 individual workers, and T=10 (each time interval is a four week period)

  • Independent variables: 51 different work items. I have count data for each work item (# of times they are performed by each worker over a four week period)

  • Dependent variable: Total Working Hour of the worker (over a 4 week period)

The goal of my analysis is to find the regression coefficents (which are estimâtes of the average completion time of each work item). I may also include other regressors (other than #of work items) such as experience, age... into my model.

y= Bo + B1*X1 +...+Bk*Xk + e

y: total working hours
X: # of each work items type

Issues:

Right now, I finished cleaning and processing the data and I performed some exploratory data analysis.

  1. Some work items have a lot of zeros (the work item is only performed once or twice by several workers in the time period).

  2. From VIF, I can see that there are imperfect multicollinearity in the independent variables. Some independent variables have VIF of 5 to 6.

Questions:

  1. Any advice on how I should specify my model?

I look at boxplots and eliminate outliers of each regressor, I see that some regressors are highly skewed (due to lots of zéros).

I also plot each regressors against the total complétion time to see if there is any linear relation. So do, other looks more like a quadratic relation.

  1. Any way to deal with the multicollinearity aside from eliminating the regressors that have high VIF? This is because I need to estimate the coefficent of each of the work item.

  2. Should I set the intercept to 0? I know for sure that when ALL the regressors are 0 (# of work items are all 0, I should have zero total working hours).

I would also welcome any other advice for this problem. Thanks!

Topic model-selection regression r

Category Data Science


As you would like to retain all the predictors, you should try implementing ridge regression, which is a regularization technique popularly used for multi-collinearity problems like yours, by means of coefficient shrinkage.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.