How to process the hyphenated english words for any nlp problem?

Im doing preprocessing on english text dataset. I encounter hyphenated words like 'well-known'. Will it be useful

  • if I remove the hyphen as special character and treat it as a single word 'wellknown' or
  • separate the word into 2 'well' and 'known' or
  • use all 3 words 'well' , 'known', 'wellknown' in vector creation(BOW/TF-IDF) process for model input.

Any quick help on this would be more appreciated. Thank you.

Topic bag-of-words tokenization tfidf preprocessing nlp

Category Data Science


1 and 3 would be nice. Separating "well-known" to "well" and "known" would not be a good idea because you lost an information and/or have an erroneous/unuseful counts.


I agree with Nicholas' answer, a few more thoughts:

  • you could use a standard English tokenizer (e.g. nltk, Spacy), if only to see how they process hyphenated words. Similarly you could check how it's done in a pre-tokenized dataset, but be aware that the tokenization conventions followed might differ from one dataset to the other.
  • Imho the choice depends on the task/application and to some extent on the size of the data: if the data is large, option 1 is probably preferable because there's a chance the same hyphenated word can appear several times. However if the data is small then option 2 is better since it will allow partial match with individual tokens.
  • option 3 is an interesting compromise between options 1 and 2, but it has the disadvantage of slightly messing with the words distribution.

They all sound like interesting approaches. The first one is better I think because it allows for unseen hyphenated words to be somewhat understood (as e.g. well + known ~= well-known).

For a tfidf BOW model, you might get good performance from any of the above.

For a model that is sensitive to word order I would certainly go with the first option and might tokenise the text so that I had a token to represent the hyphen too.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.