Tokenize text with both American and English words

I need to tokenize a corpus of abstracts from an international conference. The abstracts are usually American English but sometimes British English.

Consequently, I get 2 tokens for “organization” and “organisation” or “color” and “colour”. Examples : https://en.oxforddictionaries.com/spelling/british-and-spelling

Do you know a (python) library converting “British English” to “American English” (or vis versa) ?

I would be happy to that ... (but I am french and my english is not soo good)

Thanks.

Topic text-filter nltk text-mining

Category Data Science


Grouping related tokens is called text normalization.

There is not an established Python package that does this. You could create a custom dictionary or write a function to rewrite the tokens.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.