Handling near duplicate observations in a regression / Bayesian model
I am working on a model where the underlying data is inherently correlated by groups. So some of my observations are almost duplicates but not quite.
The problem is pretty simple, I have a y variable to predict from a discrete x variable and several other potential predictor variables which may or may not be significant. The observations are not quite independent, they're taken from groups of underlying events but I want to handle this better.
I could approach the problem by only selecting one observation from each underlying group but I think this would lose a lot of data and statistical power. I would like some method where I can keep all my datapoints but possibly impose some artificial weight scheme or something to weight data less if there are near-duplicates elsewhere or if they are from the same underlying event.
In particular, I'm using a Bayesian regression approach with pymc3 so could do some fudging of variances in my likelihood to fit less for values which have lots of close duplicates from the same underlying group.
Does anyone have experience with similar problems and solutions to deal with it?
I think my model is currently very much overfitting due to high correlations between some predictors and predictions.
Topic collinearity regression
Category Data Science