Tokenize text with both American and English words

Question

user3259111

2019年12月3日 09:00

I need to tokenize a corpus of abstracts from an international conference. The abstracts are usually American English but sometimes British English.

Consequently, I get 2 tokens for “organization” and “organisation” or “color” and “colour”. Examples : https://en.oxforddictionaries.com/spelling/british-and-spelling

Do you know a (python) library converting “British English” to “American English” (or vis versa) ?

I would be happy to that ... (but I am french and my english is not soo good)

Thanks.

Brian Spiering · Accepted Answer · 2017年9月22日 16:07

Brian Spiering answered at 2017年9月22日 16:07

Grouping related tokens is called text normalization.

There is not an established Python package that does this. You could create a custom dictionary or write a function to rewrite the tokens.