Conceptual clustering with sklearn?
How can I perform conceptual clustering in sklearn? My use case is that I have English Wikipedia articles that I'm doing unsupervised learning on (tfidf -> truncated svd -> l2 normalize), and I'd like to create a hierarchy for them such that the nodes at the top are the most general articles (e.g. Programming Languages -> Functional Languages -> Haskell).
I tried using hierarchy.linkage
, but it seems that the algorithm uses n^2
space, and I ran out of memory. I also tried using a KDTree
on the l2
normalized vectors, and then setting each node to be the normalized sum of its children recursively, but this did not produce desirable results.
What is the right way to perform conceptual clustering with cosine similarity in scikit-learn without using quadratic space?
Topic unsupervised-learning scikit-learn clustering
Category Data Science