What is the difference between data leakage and endogeneity?

Question

What is the difference between data leakage and endogeneity?

Tanguy

2021年8月20日 21:56

I have the impression the former is used in ML whereas the latter is used in econometrics. They both carry the idea that information from the target is "leaking" in explanatory variables.

Is there any difference between those two notions?

Topic data-leakage

Category Data Science

Statwonk · Accepted Answer · 2021年8月20日 21:56

1

Statwonk answered at 2021年8月20日 21:56

No, data leakage is an example of simultaneity, which is a form of endogeneity.

Sunny · Accepted Answer · 2019年3月27日 16:59

Endogeneity refers to explanatory variables correlated with error term because of a missing variable or measurement error. Data Leakage is introduction of spurious explainability in the model because of bringing in new data than ground truth like synthetic data (say we used SMOTE). In the former case, we will see that the model is biased too much for some feature coefficients whose explainability seeped into the error term. In the latter case we will see that the model will have high variance and lacks generalization which makes it useless for production/test data. Typical example to introduce data leakage is by SMOTING (yeah using the word as a verb) before you do validation split.

What is the difference between data leakage and endogeneity?

About