Conceptual clustering with sklearn?

How can I perform conceptual clustering in sklearn? My use case is that I have English Wikipedia articles that I'm doing unsupervised learning on (tfidf -> truncated svd -> l2 normalize), and I'd like to create a hierarchy for them such that the nodes at the top are the most general articles (e.g. Programming Languages -> Functional Languages -> Haskell).

I tried using hierarchy.linkage, but it seems that the algorithm uses n^2 space, and I ran out of memory. I also tried using a KDTree on the l2 normalized vectors, and then setting each node to be the normalized sum of its children recursively, but this did not produce desirable results.

What is the right way to perform conceptual clustering with cosine similarity in scikit-learn without using quadratic space?

Topic unsupervised-learning scikit-learn clustering

Category Data Science


Scikit-learn does not natively support conceptual clustering. You'll have to implement conceptual clustering yourself or find another implementation.

To use a hierarchy of knowledge, you'll have to find an existing one or create your own. Unsupervised learning is not a useful method to create a hierarchical structure based on semantic meaning.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.