Regression model for continuous dependent variable and count independent variables
I am currently learning R and I am relatively inexperienced in the field. Hope I can get some advice from you guys!
I am working on a project where I have to estimate the average processing time of different work items (tasks).
I have the following panel data:
My sample size is n=2000 individual workers, and T=10 (each time interval is a four week period)
Independent variables: 51 different work items. I have count data for each work item (# of times they are performed by each worker over a four week period)
Dependent variable: Total Working Hour of the worker (over a 4 week period)
The goal of my analysis is to find the regression coefficents (which are estimâtes of the average completion time of each work item). I may also include other regressors (other than #of work items) such as experience, age... into my model.
y= Bo + B1*X1 +...+Bk*Xk + e
y: total working hours
X: # of each work items type
Issues:
Right now, I finished cleaning and processing the data and I performed some exploratory data analysis.
Some work items have a lot of zeros (the work item is only performed once or twice by several workers in the time period).
From VIF, I can see that there are imperfect multicollinearity in the independent variables. Some independent variables have VIF of 5 to 6.
Questions:
- Any advice on how I should specify my model?
I look at boxplots and eliminate outliers of each regressor, I see that some regressors are highly skewed (due to lots of zéros).
I also plot each regressors against the total complétion time to see if there is any linear relation. So do, other looks more like a quadratic relation.
Any way to deal with the multicollinearity aside from eliminating the regressors that have high VIF? This is because I need to estimate the coefficent of each of the work item.
Should I set the intercept to 0? I know for sure that when ALL the regressors are 0 (# of work items are all 0, I should have zero total working hours).
I would also welcome any other advice for this problem. Thanks!
Topic model-selection regression r
Category Data Science