Is zero-inflated negative binomial regression appropriate for this data? Am I interpreting it correctly?

I am evaluating whether governance predictor variables are associated with the prevalence of groundwater fecal contamination in a developing country context, as measured by TTC (Thermotolerant Coliform) counts per 100mL of water. In my data TTC is distributed non-normally. There are many zeroes, and also many water sources with TTC of 125+ (our test kits could not measure TTC above this threshold). I ran countfit on TTC and various predictors and it appeared to indicated ZINBR was the appropriate regression to run here.

My questions are: 1) Is zero-inflated negative binomial regression the right regression type to run with this data? 2) When STATA asks me to specify an inflation variable, should I specify ALL predictor variables, or only those that I expect might be driving the over-representation of zeroes in the dependent variable? For example, if my hypothesis is that there is a general association between some governance variable and TTC, but I ALSO think that TTC is affected by the type of water source and that water source type might explain the zeroes (say, because deep, mechanically drilled wells rarely become contaminated regardless of governance while shallow hand dug wells are more susceptible to environmentally-induced contamination), should I only specify water source type as my inflation variable, or should I also specify the governance variables that I am testing as inflation variables? 3) When I interpret the output, am I correct in understanding that an Incident Risk Ratio of 0.74 associated with a binary variable means that if the binary variable is present, predicted TTC = intercept*0.74? (see image 5) 4) When I interpret the output, am I correct in understanding that an Incident Risk Ratio of 0.99 associated with a continuous variable such as percentage of a community with sanitation access, would mean that for each increase of 1% sanitation access, you multiply the intercept*0.99, so that if sanitation access= 10%, your predicted TTC= intercept * 0.99^10? (see image 5) 5) If I include a variable both in the main model as well as the logit model (I'm unsure if that's actually permissible, see question 2), and if the variable is significant in either the main model or the logit model but not both, how am I to interpret that (see image 6)?

Topic counts regression

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.