An algorithm for Automatic Tag Clustering

Out website dinf is somewhat like StackExchange: people are submitting small definitions of concepts. We would like to automatically assign those concepts into 'Topics'. The problem is that dinf by default limits any definition to max of 500 characters. Which algorithm / module we can use to assign those concepts assuming that all topics are known in advanced?

Topic word2vec knowledge-base nlp clustering

Category Data Science


The limited size is not especially an issue. However the normal way to do that is to use a supervised classification method (for instance decision trees, but there are many options), and this means training a model from a large enough set of annotated instances.

If obtaining a training set is not possible, you could try some kind of similarity-based approach by comparing every instance against a set of words representing the topic. This is unlikely to work as well as a supervised method trained specifically with this kind of data. Also you would have to evaluate the validity of the predicted topics on an annotated test set anyway.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.