How to process the hyphenated english words for any nlp problem?

Question

How to process the hyphenated english words for any nlp problem?

emily

2020年9月10日 12:25

Im doing preprocessing on english text dataset. I encounter hyphenated words like 'well-known'. Will it be useful

if I remove the hyphen as special character and treat it as a single word 'wellknown' or
separate the word into 2 'well' and 'known' or
use all 3 words 'well' , 'known', 'wellknown' in vector creation(BOW/TF-IDF) process for model input.

Any quick help on this would be more appreciated. Thank you.

Topic bag-of-words tokenization tfidf preprocessing nlp

Category Data Science

bonez001 · Accepted Answer · 2020年9月10日 12:25

1

bonez001 answered at 2020年9月10日 12:25

1 and 3 would be nice. Separating "well-known" to "well" and "known" would not be a good idea because you lost an information and/or have an erroneous/unuseful counts.

Erwan · Accepted Answer · 2020年9月1日 14:09

I agree with Nicholas' answer, a few more thoughts:

you could use a standard English tokenizer (e.g. nltk, Spacy), if only to see how they process hyphenated words. Similarly you could check how it's done in a pre-tokenized dataset, but be aware that the tokenization conventions followed might differ from one dataset to the other.
Imho the choice depends on the task/application and to some extent on the size of the data: if the data is large, option 1 is probably preferable because there's a chance the same hyphenated word can appear several times. However if the data is small then option 2 is better since it will allow partial match with individual tokens.
option 3 is an interesting compromise between options 1 and 2, but it has the disadvantage of slightly messing with the words distribution.

Nicholas James Bailey · Accepted Answer · 2020年9月1日 13:15

They all sound like interesting approaches. The first one is better I think because it allows for unseen hyphenated words to be somewhat understood (as e.g. well + known ~= well-known).

For a tfidf BOW model, you might get good performance from any of the above.

For a model that is sensitive to word order I would certainly go with the first option and might tokenise the text so that I had a token to represent the hyphen too.

How to process the hyphenated english words for any nlp problem?

About