Scaling DBSCAN clustering - minHash?

Applying density based clustering (DBSCAN) on $50k$ data points and about $2k$-$4k$ features, I achieve the desired results.

However, scaling this to $10$ million data points requires a creatively efficient implementation since DBSCAN requires $O(n^2)$ to calculate the distance matrix and crushes my memory.

There must be some efficient sampling-based method to overcome this, ideally something similar to minHash - but I'm not sure how to approach this, and if there exists a solution that can work on existing sklearn DBSCAN algorithm. Any ideas?

Topic dbscan clustering scalability machine-learning

Category Data Science


DBSCAN is O(n) times the cost of a neighbor search.

If you use an index like LSH that could answer neighborhood search in O(1) (assuming a very even data distribution, with O(1) neighbors per point) then DBSCAN can run in O(n) plus the time needed to build such an index.

Yes, minHash indexes could be used if they are appropriate for your data and distance.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.