What is the difference between data leakage and endogeneity?

I have the impression the former is used in ML whereas the latter is used in econometrics. They both carry the idea that information from the target is "leaking" in explanatory variables.

Is there any difference between those two notions?

Topic data-leakage

Category Data Science


No, data leakage is an example of simultaneity, which is a form of endogeneity.


Endogeneity refers to explanatory variables correlated with error term because of a missing variable or measurement error. Data Leakage is introduction of spurious explainability in the model because of bringing in new data than ground truth like synthetic data (say we used SMOTE). In the former case, we will see that the model is biased too much for some feature coefficients whose explainability seeped into the error term. In the latter case we will see that the model will have high variance and lacks generalization which makes it useless for production/test data. Typical example to introduce data leakage is by SMOTING (yeah using the word as a verb) before you do validation split.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.