Is it good practice to remove the numeric values from the text data during preprocessing?

Im doing preprocessing on a text dataset. I have certain numerics in it like:

  • date(1st July)
  • year(2019)
  • tentative values (3-5 years/ 10+ advantages).
  • unique values (room no 31/ user rank 45)
  • percentage(100%)

Is it recommended to discard this numerics before creating a vectorizer(bow/tf-idf) for any model(classification/regression) development?

Any quick help on this is much appreciated. Thank you

Topic bag-of-words hashingvectorizer tokenization tfidf nlp

Category Data Science


To build on Prashant's answer, it will depend on your problem. If you think those values are important to your task, you might try to extract them and tack them onto the end of your data (I'm thinking like [this question asked here] which uses multiple different kinds of data in a regression problem).

An easy thing to do (and probably the right call the majority of the time) would be to just remove all those numbers, but another strategy I've seen elsewhere is to use rules to convert the different numbers into their "type". Meaning 2019 used as a year would be replaced with a token like #YEAR, 100% replaced by #PERCENT, etc.


Is it recommended to discard this numerics before creating a vectorizer(bow/tf-idf) for any model(classification/regression) development?

It depends on the problem statement for example year could be significant if you want to find the trend and year has many unique value but if it's constant then you can remove it.

To add to that if you are doing sentiment analysis then numeric variables don't make much sense.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.