What machine learning algorithms to use for unsupervised POS tagging?

I am interested in an unsupervised approach to training a POS-tagger.

Labeling is very difficult and I would like to test a tagger for my specific domain (chats) where users typically write in lower cases etc. If it matters, the data is mostly in German.

I read about about old techniques like HMM, but maybe there are newer and better ways?

Topic unsupervised-learning parsing nlp machine-learning

Category Data Science


Very interested to hear what you need a tagger for in the context of chatbots?

Maybe you need just a stemmer - to produce 'base form' for an inflected word?

In that case, you can check this.


There is no genuinely unsupervised method for POS tagging; we can think of it as, Parts of speech are inferred by us, with rules defined by the specific language being tagged. There is no mathematical "notion" for a part of speech that we can conclude given some text without any predefined rule established empirically (Which is why it is not genuinely unsupervised).

A weakly-supervised approach: Estimate the hidden state parameters of HMM using the Baum-Welch Algorithm.

And other is to implement a Maximum Entropy Model utilizing Beam Search, with rules established empirically(hence, not truly unsupervised)


Fortunately, you don't need unsupervised methods for PoS tagging for most languages, especially for German. There are semi or "weakly" supervised methods like mentioned old HMM/EM approaches, however there is new and quite fresh solution with Error-Correcting Output-Code classification: Weakly supervised POS tagging without disambiguation.

Of course the accuracy of fully supervised methods like LSTM is far far better from semi supervised, but due to known issues of fully supervised methods (eg. lot of manual work) people still try to find lazy approaches. Excellent accuracy always cause higher costs.


There are no unsupervised methods to train a POS-Tagger that have similar performance to human annotations or supervised methods.

The current state-of-the-art supervised methods for training POS-Tagger are Long short-term memory (LSTM) neural networks.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.