Removing duplicate records before training

Question

Removing duplicate records before training

astel

2020年5月14日 18:47

I am currently working on a project classifying text into classes. The specific problem is classifying job titles into various industry codes. For example "McDonalds Employee" might get classified to 11203 (there are a few hundred classes in the problem). For this we are using FastText.

The person that I am working with insists on removing duplicate records from the data before training our model. That is, we might see 100 records with "McDonalds Employee" and class 11203 and he wants to remove all but one of them. His argument is that not doing so could result in over-fitting and an optimistic error rate as the same records will appear in all the train/test/validation sets. My counter to this is that I expect to see (many) records with "McDonalds Employee" in our future data and I would like to know how the model is going to do at predicting this, hence we will not arrive at an optimistic error rate but a properly calculated error rate. Secondly, if our data for some reason has one record "McDonalds Employee" with class 24444, removing the duplicates removes all evidence that the correct code is 11203.

I have read other posts here that suggest removing duplicates is not correct, but I have yet to see an actual source in the literature stating this. Since I have to convince a colleague my question is two fold: Does anyone know of any reference in the literature that suggests keeping duplicates? As well, is there any reason to remove duplicates specific to FastText? I admit I am not that familiar with NLP and FastText (or even neural networks in general), so it is possible maybe there is some reason to remove them when training a model of this type.

Topic fasttext overfitting nlp

Category Data Science

Removing duplicate records before training

About