Logistic Regression : shouldn't weighting by the number of instances give the same result ? What could explain the discrepency?

Question

Logistic Regression : shouldn't weighting by the number of instances give the same result ? What could explain the discrepency?

lcrmorin

2021年8月5日 14:39

I am performing a logistic regression in a standard supervised framework (Data Set X, target y). The dataset X is composed of a handfull of categorical variables (that I one-hot encode), thus it contains a lot of redundant rows (1000s unique rows over millions of initial rows). Having a lot of redundant rows I was tempted to agregate them, weight them by their count in the fit and get approximately the same result. However I was surprised to get variation of +/- 30% in the final parameters of the model. I made sure to use the encoding (column ordering) and disabled unusual parameters (default l2 regularisation in sklearn).

Should I expect the same result ? Might the difference be explained in the optimisation problem (fitting the parameters) ? How can I reduce this effect ?

Topic weighted-data logistic-regression scikit-learn

Category Data Science

Logistic Regression : shouldn't weighting by the number of instances give the same result ? What could explain the discrepency?

About