How cluster a twitter data-set?

I have a twitter data-set and I wanna extract their related topics. So, I decided to classify my Tweets into clusters using an unsupervised machine learning algorithm like k-means. This choice is made due the time consuming of the training process in the supervised approaches.

So, as a first step after cleaning my tweets, I will extract features (eg. Hashtags...) from them, and enrich them with side information from knowledge bases (eg. Wikipedia). Secondly, they will be represented in a Vector space. Next, using k-means and for given K=6 clusters, my tweets already enriched will be classified into 6 clusters.

However, I don’t know how to identify automatically the topics related to these clusters. Is there any solutions?

Topic social-network-analysis nlp clustering data-mining machine-learning

Category Data Science


k-means is very sensitive to noise

because it is designed as a least-squares approach. Noise deviations, when squared, become even larger.

Twitter is mostly noise

Twitter is full of spam and nonsense tweets. These will be entirely unlike any other and thus have the largest deviations.

Chances are you get one "cluster" that contains almost everything, and the other k-1 clusters consist of a few tweets with their duplicates. Clusters are not topics. They are more likely to be duplicates than topics.

An appropriate clustering algorithm for tweets should probably discard 90% of tweets and produce thousands of clusters. But it will rarely be better than finding all tweets in common - most tweets only have 2-3 usable words.


Have you found a good approach? I am envolved in the same work right now. My approach is the following:

1) Make a vector respresentation of all texts in the data set, for example with tfidf technique.

2) Take the first vector and put it in a pile.

3) Enter in the following loop:

3a) take the next vector and compute the cosine similarity between this vector and the centroid of each built pile.

3b) if one of this cosine similarity falls below a predefined threshold, stack this document representation in the corresponding pile. Another case, build a new pile with this vector.

3c) recompute the centroid of each modified pile.

This algorithm is going to find similar tweets, which we suppose that are related with same topic.


Basically if I rephrase your task - you have a large document which you want to summarize. Text mining is your tool - you can either choose traditional approaches such as tf-idf, tf and etc. I would recommend use holmertz technique - in such framework it makes stuff easier as it can detect stopwords on its own, extract features and etc. Hierarchical clustering can also work, check if you will not get obvious words as cluster centers - filtering them require subject matter knowledge and additional time.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.