Handling near duplicate observations in a regression / Bayesian model

I am working on a model where the underlying data is inherently correlated by groups. So some of my observations are almost duplicates but not quite.

The problem is pretty simple, I have a y variable to predict from a discrete x variable and several other potential predictor variables which may or may not be significant. The observations are not quite independent, they're taken from groups of underlying events but I want to handle this better.

I could approach the problem by only selecting one observation from each underlying group but I think this would lose a lot of data and statistical power. I would like some method where I can keep all my datapoints but possibly impose some artificial weight scheme or something to weight data less if there are near-duplicates elsewhere or if they are from the same underlying event.

In particular, I'm using a Bayesian regression approach with pymc3 so could do some fudging of variances in my likelihood to fit less for values which have lots of close duplicates from the same underlying group.

Does anyone have experience with similar problems and solutions to deal with it?
I think my model is currently very much overfitting due to high correlations between some predictors and predictions.

Topic collinearity regression

Category Data Science


Some models are fine with duplicate data. For example, this is how oversampling works in unbalanced datasets, Also Naive Bayes should handle this. However there are many other models will end up giving high prediction rates because of this. Duplicate rows can cause problems due to the same data being in the train and test datasets. Those rows will give a high accuracy to the model, since the training dataset already knows the answer. Exact duplicates can be found and removed using

df.duplicated()
df.drop_duplicates(inplace=True)

Finding similar data will be a bit more difficult. You can use these same 2 functions on a partial set of columns if you think that will find more matches. Or you can run the data through a cluster algorithm (k-means) and sample from the different clusters, or add a weight column based on the clustering output, maybe divide by the number of points in each cluster. You can then pass in this weight vector into the model.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.